A reference-free clustering method for the analysis of molecular break-junction measurements

Single-molecule break-junction measurements are intrinsically stochastic in nature, requiring the acquisition of large datasets of “breaking traces” to gain insight into the generic electronic properties of the molecule under study. For example, the most probable conductance value of the molecule is often extracted from the conductance histogram built from these traces. In this letter, we present an unsupervised and reference-free machine learning tool to improve the determination of the conductance of oligo(phenylene ethynylene)dithiol from mechanically controlled break-junction (MCBJ) measurements. Our method allows for the classiﬁcation of individual breaking traces based on an image recognition technique. Moreover, applying this technique to multiple merged datasets makes it possible to identify common breaking behaviors present across different samples, and therefore to recognize global trends. In particular, we ﬁnd that the variation in the extracted molecular conductance can be signiﬁcantly reduced resulting in a more reliable estimation of

The development of experimental tools for the collection of large datasets plays a key role in molecular electronics. [1][2][3][4] In particular, recent technological advancements in break-junction (BJ) techniques such as the scanning tunneling microscopy one (STM-BJ) 2,4 or the mechanically controlled one (MCBJ) 5 have made it possible to acquire statistically relevant datasets. Their measurement principle consists of repeatedly forming a quantum point contact in the presence of molecules while at the same time measuring the current flowing through it. Each breaking trace provides information about the conformation and configuration of the junction. 6 However, it is only through the statistical analysis of thousands of such traces, that an in-depth mapping of the breaking dynamics [7][8][9] and a meaningful interpretation of the molecular junction behavior can be obtained. [10][11][12] In break-junction measurements, molecules are usually identified by the presence of plateau-like features after the breaking point of the last gold-gold atom connection, which has conductance equal to the conductance quantum (G 0 ¼ 2e 2 =h, where e is the elementary charge and h is Planck's constant). Since the behavior of the molecules in the junction can vary from one breaking trace to another (e.g., due to different injection points, number of molecules, electrode shapes, anchoring configuration, electronic coupling, level alignments, etc…), the recorded breaking traces may exhibit diverse features. A common way to process these traces and obtain statistical information about the most probable conductance value of the molecule (G M ) is by building a conductance histogram and fitting the prominent peak with a lognormal distribution. 13 In the following, we illustrate why this approach may lead to inaccurate data interpretation and conclusions. For this purpose, we use a dataset recorded on an oligo(phenylene ethynylene)dithiol (OPE3) molecule consisting of more than 50 000 breaking curves.
These curves are obtained from six MCBJ samples and recorded at different breaking speeds and bias voltages [see details in Table S1 in Sec. II of the supplementary material), forming in total 16 datasets, all carried out at room temperature and under ambient conditions. Figure 1 shows two examples of conductance histograms built from these 16 datasets of breaking curves recorded at 100 mV. Although both datasets are recorded with the same experimental settings, they exhibit different molecular yields 14 (details about the calculation of the molecular yield are given in Sec. IV of the supplementary material). From this comparison plot, two main observations can be made: (i) the peak shapes and relative amplitudes are different for the two datasets, even though the same molecule is measured and (ii) the extracted values of G M are not the same and differ by up to a factor of 4, when comparing the two most extreme values.
In total, 11 different datasets have been recorded with a bias voltage of 100 mV, each exhibiting different molecular yields, varying from 3% to 63%. 14 The inset of Fig. 1 presents the extracted value of G M for all of them. The graph shows a considerable spread of about half an order of magnitude in conductance, and an apparent increase in G M for the increasing molecular yield. The dependence of G M on the molecular yield raises an important point about the correct interpretation of the extracted G M values. This is particularly important for datasets exhibiting various types of breaking curves in which the most probable conductance value obtained from the raw histogram cannot be attributed to a unique molecular conformation, and more importantly, cannot be considered as a universal conductance value associated with that of a single and fully stretched molecule.
The problem of classification of breaking traces has recently been tackled using machine learning (ML) tools. 15,16 Generally speaking, ML algorithms can be subdivided into two main categories: supervised and unsupervised learning. 17,18 Supervised learning is used when the nature of the desired ML model output is known. For example, recently, supervised learning was used to train an artificial neural network for classifying experimental breaking curves of gold breakjunctions based on labeled traces obtained by molecular dynamics simulations. 19 Moreover, a "deep" neural network has recently been applied to single-molecule measurements for DNA sequencing applications. 20 In contrast to that, the outcome of the unsupervised ML model is not predefined and the ML algorithms are used to detect the underlying structures of a given dataset. This approach allows, e.g., to classify the data according to specific characteristic features. Unsupervised learning has successfully been applied to the classification of breaking curves, 15,21,22 highlighting the importance of using more sophisticated tools to identify different types of breaking behavior in a given set of traces. However, the classification algorithm applied in those studies needed a reference vector, the choice of which may affect the clustering outcome, as shown in Fig. S2 of the supplementary material. Recently, an unsupervised clustering approach to identify the hierarchical data structure has been reported, 23 but in this case, the clustering required several parameters, and was performed on the 2D conductancedisplacement histograms, and not on the individual breaking traces.
Our approach, schematically depicted in Fig. 2(a), aims to identify features in the experimental breaking traces and group the traces accordingly. The general workflow for the unsupervised ML classification can be summarized as follows: (i) construction of the feature space containing the relevant information about the shape of every breaking curve and (ii) applying a ML clustering algorithm in the constructed feature space that groups feature vectors into clusters. In the following, the workflow is explained in detail. The starting point of the method is the set of individual breaking traces, of which several examples are presented in Fig. 2(b). Each trace exhibits a specific shape and corresponds to a different breaking scenario of the junction. The blue trace, for example, shows a sharp conductance drop below 1 G 0 followed by an exponential conductance decrease as a function of the electrode displacement. This behavior is indicative of tunneling across a barrier when no molecule is bridging the electrodes. On the other hand, the green and red traces exhibit a step-like behavior below 1 G 0 , which is associated with at least one OPE3 molecule trapped between the electrodes. However, the two curves do not have the same shape. The green trace has a conductance plateau around 1 Â 10 À3 G 0 , while the red one exhibits a longer plateau at a lower conductance (%1 Â 10 À4 G 0 ). These two traces exemplify the variability of molecular junctions during breaking and illustrate how MCBJ measurements allow to stochastically probe different molecular conformations/behaviors.
The creation of the applied feature space is partly inspired by the well-known MNIST dataset for handwritten digits, in which the images of the digits are reduced to 28 Â 28 pixel images. 17 The MNIST dataset is often used to train various supervised ML models to identify handwritten digit images. During the learning process, the ML model identifies relevant features related to each digit in the training set using the discretized images. We employ the same principle to construct the feature space for our breaking trace classification approach, even though it is an unsupervised learning problem. As any image constitutes a two-dimensional (2D) representation of the shape of an object, one can naturally think of transforming every breaking curve into an individual 2D image. In our case, the created images are individual 2D histograms. Figure 2(c) illustrates how a breaking curve is transformed into an image by defining a region of interest (ROI), i.e., the boundaries of the final image. In the case of the OPE3 dataset, the ROI is defined in a conductance range between 1 Â 10 À6 and 1 G 0 and an electrode displacement range of 0-2 nm. The conductance range excludes the behavior of the metallic contact and of the measurement  Table S1 in Sec. II of the supplementary material), respectively. The black dashed lines and the red/blue shaded regions correspond to log-normal distribution fits to the red/blue curves. The red/blue dots highlight the maximum of the log-normal distribution fits allowing to extract G M . The inset shows G M for all samples as a function of the molecular yield. The red/blue data points highlight the extracted G M values using the red/blue histograms. noise-floor, while focusing purely on the behavior after the breaking of the junction. In addition to the ROI, the number of bins along the conductance and displacement axes is defined, i.e., the resolution of the individual 2D histograms. For instance, the breaking curve shown in Fig. 2(c) is transformed into an M Â N pixel image. To construct the feature space, every 2D histogram is converted into a feature vector associated with a specific breaking curve. Each component of the obtained vectors represents one dimension in the feature space, meaning that one has to deal with a high-dimensional space. For example, in the case of 28 Â 28 pixel images [ Fig. 2(c)], the resulting feature space has 784 dimensions.
The last step of the classification task is to choose an appropriate clustering algorithm for the created feature space. For this work, we use the K-meansþþ method (from the Scikit-Learn Python library), which is one of the most popular clustering techniques (see details about the algorithm principle in Sec. VIII of the supplementary material). Its popularity is mainly due to its conceptual simplicity, low computational cost, and scalability, unlike more advanced clustering techniques. Another important advantage of the K-meansþþ algorithm is its ability to deal with high-dimensional spaces, which is a necessary requirement for our feature space. We note that other more advanced algorithms, e.g., taking into account nonisotropic cluster shapes, have been tested but fail in the case of such a high-dimensional feature space (see details in Sec. IX of the supplementary material).
To obtain a reduced representation of the high-dimensional space while preserving the maximum data variance, we employ a method called principal component analysis (PCA). This technique consists in projecting every feature vector onto the first three eigenvectors of the covariance matrix of the analyzed data. Figure 2(d) displays the reduced feature vector distribution, as well as the classification results obtained for one of the OPE3 datasets (sample 5a, see Table S1 in Sec. II of the supplementary material). We note that the PCA is not used in the clustering algorithm but only to reduce the 784-dimensional feature vectors to three dimensions for visualization purposes. The three clusters obtained from the high-dimensional clustering algorithm are plotted in different colors. It is important to realize that each point in the scatter plot corresponds to a single breaking trace. The clusters can subsequently be used to construct the 2D histograms belonging to the different classes, as shown in Fig. 2(d). The plot shows that the blue cluster (class 1) mainly contains breaking traces without any molecular signatures, while the green (class 2) and red (class 3) clusters are related to curves with plateau-like features. The green cluster contains breaking traces with conductance plateaus around 1 Â 10 À3 G 0 , while for the red class, longer plateaus are observed at lower conductance values (% 1 Â 10 À4 G 0 ).
We now apply this approach to investigate the influence of the molecular yield on the most probable conductance value. For this purpose, we group all the datasets recorded at a fixed bias voltage (V) of 100 mV, i.e., in 11 sets with a cumulative amount of traces exceeding 40 000 curves (see Table S1 in the supplementary material). The conductance and 2D histograms for the full dataset are shown in Fig. 3(a), alongside the histograms of the three classes identified using our clustering method [see Figs Fig. 3(c)], while the histograms of class 3 show flat and long plateaulike features around 1 Â 10 À4 G 0 [ Fig. 3(d)]. In the following, we focus on class 2 and class 3.
As the clustering algorithm has been applied to the full dataset, and all the classes were found in all the datasets, the three identified clusters are universal, allowing for tracking of these classes and their respective occurrence across them (see details in Sec. V of the supplementary material). The extracted values of G M for classes 2 and 3 as a function of the molecular yield are shown in Fig. 4(a). For both classes, G M remains largely unaffected by the molecular yield, with extracted values of 5.0 6 1.1 Â 10 À4 G 0 and 1.1 6 0.1 Â 10 À4 G 0 for class 2 and class 3, respectively. The horizontal red dashed line indicates the most probable conductance value obtained by considering all curves belonging to class 3, while the shaded area corresponds to the standard  Table S1 of the supplementary material). The blue, green, and red points correspond to the reduced feature vectors related to the three different clusters/classes formed after using the K-meansþþ method. The 2D histograms built from the breaking curves of each class are displayed according to the cluster color.

ARTICLE
scitation.org/journal/apl deviation of the red data points in Fig. 4(a). The plot also shows that the most probable conductance value of the unfiltered histogram is dominated by class 3, consisting of long traces up to 1.5 nm around 1 Â 10 À4 G 0 . Such traces are commonly considered to originate from a single and fully stretched molecule bridging the two electrodes. However, the plot also demonstrates that it is only after the removal of class 1 and class 2 that the molecular conductance of these "ideal" molecular junctions is unraveled. In contrast to the unfiltered data, the systematic dependence of G M on the molecular yield is now absent for classes 2 and 3. In addition, the standard deviation in G M for class 3 is about 5 times smaller than for the unfiltered dataset, allowing for a more accurate determination of the molecular conductance of the fully stretched molecule. Moreover, a large portion of the unfiltered conductance values lies outside the standard deviation of the conductance of class 3. These observations highlight the importance of data classification methods in break-junction measurements.
Finally, we also apply our method to a second dataset series with the aim to investigate the influence of the bias voltage on the determination of G M . For this study, to avoid sample-to-sample variations, six OPE3 datasets successively recorded on the same sample for different bias voltages (V ¼ 50, 100, 150, 200, 250, and 300 mV) are merged to obtain a dataset containing more than 10 000 curves [see the conductance and 2D histograms in Fig. S5(a)  The extracted G M values vs bias voltage are shown in Fig. 4(b). When considering the unfiltered data, a pronounced increase in G M is observed for an increasing bias voltage. Classes 2 and 3, on the other hand, exhibit a smaller dependence. As a comparison, the graph highlights the extracted conductance range of class 3 for the yield dependence. While the most probable conductance of class 3 recorded at 50 mV and 100 mV lie at the edge of the conductance range, the extracted G M in a bias voltage range of 150-300 mV lie outside. This observation suggests that the bias voltage indeed has the effect of increasing the conductance, as expected in the case of electron tunneling through a single, broadened level.
In our analysis, similar to previous studies, 15 the number of clusters is a free parameter and needs to be defined empirically. For the purpose of this letter, we have used 3 clusters. Nevertheless, this choice can be rationalized as follows: one can assume that each cluster corresponds to one particular molecular configuration with a distinct conductance. However, each cluster still includes local conformational and configurational changes which may lead to conductance fluctuations.
In our case, class 1 corresponds to the formation of junctions in which only single-barrier tunneling is observed, without any molecule bridging the gap. On the other hand, classes 2 and 3 show molecular signatures such as the formation of well-defined plateaus. Even though both classes have a molecular origin, clear differences between the two are observed. Class 2 is characterized by a slanted plateau with a higher conductance, whereas class 3 exhibits flat plateaus at lower conductance values. A possible explanation of this behavior may be related to variations in the anchoring of the molecule to the electrodes and the resulting changes in the injection point of the charges into the molecules. The long and flat plateaus present in class 3 are commonly believed to originate from a single-molecule bound to the two electrodes. In this scenario, charge transport occurs via the covalent bonds Au-S. For class 2, on the other hand, charges may also be injected through-space into the benzene rings via the overlap between its porbitals and the gold electrode wavefunctions. In this case, the sliding of the molecule on top of the electrodes and the resulting change in the orbital overlap may explain the gradual decay in conductance and  the shorter plateaus. To gain more insights into the junction formation, density functional theory and/or molecular dynamics calculations would be required. Such calculations, however, are beyond the scope of this letter.
To summarize, we demonstrated that unsupervised ML methods applied to MCBJ measurements allow for significant improvements in the extraction of the molecular conductance. Using our reference-free approach, we identify different molecular classes, which we track across several datasets. As such, we find that the two identified molecular classes are largely independent on the molecular yield, in contrast to the unfiltered data. Moreover, we find that the standard deviation of the extracted molecular conductance of the fully stretched molecule can be reduced by a factor of 5. Finally, by applying our approach on datasets with varying bias voltages, we observe a small dependence on the molecular conductance. The obtained results highlight the importance of using advanced and appropriate tools, such as ML algorithms, to efficiently analyze break-junction data and extract meaningful statistical molecular information.
See supplementary material for the influence of the reference vector in the method proposed by Lemmer et al., 15 a description of the whole OPE3 datasets, the extraction of the most probable conductance, and the calculation of the molecular yield. A more detailed description of the clustering algorithm and results are also presented.