Maximum volume simplex method for automatic selection and classification of atomic environments and environment descriptor compression

Fingerprint distances, which measure the similarity of atomic environments, are commonly calculated from atomic environment fingerprint vectors. In this work we present the simplex method which can perform the inverse operation, i.e. calculating fingerprint vectors from fingerprint distances. The fingerprint vectors found in this way point to the corners of a simplex. For a large data set of fingerprints, we can find a particular largest volume simplex, whose dimension gives the effective dimension of the fingerprint vector space. We show that the corners of this simplex correspond to landmark environments that can by used in a fully automatic way to analyse structures. In this way we can for instance detect atoms in grain boundaries or on edges of carbon flakes without any human input about the expected environment. By projecting fingerprints on the largest volume simplex we can also obtain fingerprint vectors that are considerably shorter than the original ones but whose information content is not significantly reduced.


I. INTRODUCTION
Materials science has become to a large extent a data driven science.Several data banks exist that contains not only structural data, but calculated properties as well; many exceed the hundreds of thousands structural properties in number, with their number growing dramatically [1][2][3][4].Molecular dynamics simulations typically also generate very large data sets.Such large data sets can not any more be inspected by eye and tools for classifying the structures in an automatic way are needed.Atomic environments can be described in a quantitative fashion by descriptors called "atomic environment fingerprints" [5][6][7][8][9], that can also provide a description for entire crystalline structures [10].Atomic environment fingerprints are also used as inputs for supervised machine learning schemes [11][12][13] of potential energy surfaces.For such a use it is desirable that the fingerprint is able to detect any difference in the environment [14] while keeping the fingerprint vector as short as possible.
One of our goals will be the detection of grain boundaries, that are the disordered regions between one or two ordered phases.Grain boundaries have an important influence on physical properties of the system including strength, conductivity, ductility, and crack resistance to name but a few [15][16][17][18][19][20].
Several methods have been proposed in the literature to distinguish between certain reference crystalline structures and disordered and mainly liquid structures in melting and nucleation simulations such as Steinhardt parameters [21] and common neighbour analysis (CNA) [22].These methods have also been used to study dislocations, local ordering and grain boundaries [23][24][25][26][27].One of the disadvantages of these methods is that they are based on a sharp cutoff, and they end up lacking smoothness with respect to particle displacements occurring in MD or during relaxations.As its name suggests, in the adaptive common neighbour analysis [28] the cutoff is adapted to the environment of each atom.Although more robust compared to CNA, it remains sensitive to thermal vibrations.Different predefined crystalline structures can be distinguished by polyhedral template matching [29].SOAP [5] fingerprints coupled to machine learning methods were recently also used to predict properties of grain boundaries [30].Based on a formula to calculate the entropy for a system interacting only via pairwise forces, an atomic entropy can be obtained which allows to distinguish between liquid, FCC, BCC and HCP crystalline phases [31].
The common characteristic of all existing methods is that they require some human input about the relevant structures that are expected to be encountered in the simulation.This is in contrast to our method which selects all the relevant structures fully automatically based on a large pool of structures.The method is also applicable without any adjustments to any molecular system.

A. fingerprints and fingerprint distances
In this section we provide a short review of the overlap matrix fingerprint method, that we use to describe the local atomic environment.A complete description can be find in the original paper detailing the method [10].
In order to calculate the overlap matrix (OM) fingerprint for an atom k in a structure, we take into account the relative position of all the neighbours of that atom within a cutoff sphere (centered on atom k) of radius R c .Neighbours include all the relevant periodic images of an atom when dealing with an atom at the edge of a repeating unit for a periodic system.To each one of the atoms is associated a minimal set of normalized atomcentered Gaussians G ν (r − R i ), centered on the atom itself.The width of each Gaussian is given by the covalent radius of the atom on which it is centered.For carbon with its strong directional bonding we have used a set of s and p-type orbitals (ν = s, p x , p y , p z ) and denote the resulting fingerprint by OM[sp], for aluminum with its metallic bonding we have used only ν = s and denote the fingerprint by OM[s].We then calculate the overlap between Gaussian functions in the sphere.
Next, the overlap matrix S k i,ν,j,µ is multiplied by the amplitude functions f 2 is a cutoff function that vanishes at and beyond r = 2w = R c with two continuous derivatives.w gives the length scale over which f c (r) drops to zero and we typically choose it so that about 50 atoms are contained within the cutoff radius R c .The matrix whose columns are denoted by the composite index i, ν and whose rows are given by the composite index j, µ is then diagonalized to obtain the eigenvalues.Finally, the vector V k containing all the eigenvalues of the matrix Sk i,ν,j,µ is the fingerprint of atom k.It has a length L = 4N sphere for OM[sp] and L = N sphere for OM [s] where N sphere is the number of atoms in the sphere around the central atom.
By construction the fingerprint is robust against displacements of the atoms across the boundary of the sphere radius, and therefore it is possible to calculate derivatives of the fingerprints with respect to infinitesimal structural change around the atom k.The fingerprint vectors V k characterize the atomic environments around atom k and the fingerprint distance d i,j is a measure of the dissimilarity between two environments i and j.The fingerprint distance is obtained from the euclidean norm of the difference vector throughout this study:

B. Obtaining fingerprint vectors from fingerprint distances
The above formula 3 gives a trivial recipe to obtain fingerprint distances d i,j from a set of points represented by the fingerprint vectors in a space of dimension L. In the following we will derive the formulas for the inverse operation.Given a set of pairwise fingerprint distances d i,j we want to construct a set of points x i that will satisfy these constraints.The solution of this problem is not unique.The solution can however be made unique by requiring that the first point be the origin, x 0 = 0, and that for each consecutive point the number of nonzero components increases by one.Hence the points x i have the following structure: So after placing the first point at the origin, the next point lies on the positive x-axis at the right distance, the following on the xy plane (y>0), and so on.The components of the set of points x i 's can be obtained recursively from simple relations between the distances among the vectors V i 's.
The distance between x N and the origin, x 0 , is simply given by the norm of the vector: For M < N , the difference between column N and M is related to the distance between points x N and x M as By taking the difference between d 2 M,N and d 2 0,N we obtain a simplified set of equations: In Eq. 7, the unknowns x i,N depends only on other column elements x j,M with M < N .
We can write for M < N in general: and for M = N we have: The geometrical body having as corners the above calculated points is a N -dimensional simplex with volume x 1,1 x 2,2 . . .x N,N /N !.The above construction can be done for any set of Nenv(Nenv−1) 2 distances as long as the original V i 's giving rise to the distances via Eq. 3 are linearly independent.Since the number of environments N env is typically much larger than the length L of the fingerprint vectors, at most L points (including in the count the origin) can be obtained.If the number of linearly independent fingerprint vectors is less than L, x i,i will become zero for some i < L and it is thus not possible to increase the dimension of the simplex.In the context of our fingerprints, it turns out that the x i,i typically are not exactly zero but become very small which means that all the fingerprint vectors are essentially contained in a sub volume whose dimension is smaller than L. The component that is orthogonal to this subspace is then very small and can be neglected.This is the basic property which will be exploited for the fingerprint compression later in the paper.

C. Construction of the largest volume simplex
Now, we will describe how we can use the construction outlined above to obtain the largest volume simplex which we will simply denote by largest simplex (LS).We do this since we are interested to find the effective dimension l of the space spanned by the fingerprints which gives the number of the highly distinctive landmark environments together with these environments.We start by identifying the two environments characterized by the largest distance.This defines the origin x 0 and the first point along the x-axis, i.e. x 1 , and in this way the first two corners of the simplex, which is at this stage just a line.To enlarge in the next step the dimension of the simplex by one we search for the environment that will give the largest area triangle if the point x 2 , corresponding to this environment, is used as the third corner.We then increase the dimension of the simplex step by step and we choose the new corners in each step in such a way that the volume of the new simplex will be maximal.The procedure is stopped if in a certain step l the volume collapses to a very small value because additional fingerprint vectors are quasi linearly dependent on the previous ones.In this way an effective dimension l of the entire fingerprint space can be determined.Once this maximum volume simplex is constructed we can express other fingerprint vectors in the basis of the vectors x i spanning the LS simplex.To get the expansion coefficients, we just perform the same steps of Eqs. 8 to 12. that would be needed to add a corner to the simplex.However in this case we know already that the x l+1,l+1 from Eq. 12 will be negligible because we stopped the maximum volume simplex construction exactly for the reason that we could not find any point that gave a large x l+1,l+1 .

III. APPLICATIONS
In this section, we show some applications of the LS.In section III A we apply the methodology to the study of a variety of C 60 molecules, to identify the most distinct environments and group the most similar ones.In section III B we use the method to find the grain boundaries in a Al nanocrystalline material.In III C we exploit the LS to reduce the dimensions of the fingerprint and compare its performance with CUR decomposition method [32].

A. C60 clusters
Our first system to be studied consists of 5000 C 60 structures, i.e. 5000 × 60 atomic environments, that exhibit several structural motifs including sheets, chains, and cages.These structures were generated by minima hopping [33] runs coupled to DFTB [34].Our aim is to identify the most distinct atomic environments as well as to classify the environments.We use OM[sp] with a cutoff radius of R c = 2w = 6 Å and follow the approach described in section II to generate the LS with N = 60.In Fig. 2 we show the first twenty corners of the LS. which represent twenty highly distinct landmark environments in the data set.In agreement with basic chemical intuition the first two corners representing the two most different chemical environments are a four fold coordinated atom and a carbon atom at the end of a linear chain with only one nearest neighbor as shown in Fig. 2b and Fig. 2a.Other two fold coordinated atoms in chains are also represented by higher order corners of the LS as shown in Fig. 2f, 2q, 2r, and 2c.In Fig. 2c the reference atom is part of a chain but the chain points inside the cage which shows that our method can distinguish between chains that point inward or outward since it is not based solely on its nearest neighbours, but on its general environment.
The forth corner of the simplex is an atom with one nearest neighbor and near a hole in the C 60 shown in Fig. 2d.Other corners of the simplex also clearly represent truly different environments.For instance, the 8th corner of the LS shown in Fig. 2h is an atom in a graphite flake and the 16th corner of the LS is an atom in a fragmented part shown in Fig. 2p.Our data set contains only a few fragmented structures in the data set which are of type 2p and the LS could correctly recognize them as highly distinct environments.
Next, we employ the corners of the LS to analyse structures.Based on the fact that each corner represents highly distinct landmark environments, we can assume that each fingerprint that has a small distance to any of these corners represents an environment that is similar to the corresponding landmark environment.So, we assign each atomic environment to its closest corner if the fingerprint distance is less than a threshold value δ which we take to be 0.5.With this criterion, we calculate the number of environments which belong to each class as shown in Fig. 1.The environments which do not belong to any corner of the LS, because their fingerprint distance to the their closest corner is larger than δ, are shown in the blue bar in Fig. 1.Since the first corner is at the origin, Fig. 1 starts at zero.The energetic minimum of the C 60 molecule is the fullerene molecule.In this structural motif, the atomic environments for all of the carbon atoms are equivalent.This is not true any more if the fullerene has a so-called Stone-Wales defect [35].In the following we look at such a structure as well as a 60 atom graphite flake and categorize the atoms according to their fingerprint distance to the landmark environments, i.e. the corners of the LS.None of the atomic environments of these two structures is actually a landmark environment of the simplex.For the visualization, we assign a color to each corner of the simplex.All the atomic environments in the data that have a short fingerprint distance to this corner are then shown in this color.
Our method automatically classifies the atoms of the structure shown in Fig. 3 a into three types and we can easily verify by visual inspection that these three classes are in agreement with chemical intuition: We see an atom surrounded by two pentagons and one hexagon (corner 47 shown in Fig. 3 b); one pentagon and two hexagons (corner 38 shown in Fig. 3 c); or three hexagons (corner 23 shown in Fig. 3 d).As can be seen from Fig. 1, a large number of atomic environments in our data set are similar to these corners.
Another example is shown in Fig. 4. The atoms of the structure in a are similar to one of the 6 different corners of the simplex.These are shown in Fig. 4 b, c, d, e,  f, and g.So indeed groups of environments that have a short distances to a landmark environment share similar chemical environments.

B. Grain boundary networks in nanocrystalline Al
In our second application, we study a nanocrystalline Al aggregate with 255064 atoms containing grain boundary networks.The details on the generation of the nanocrystalline Al used here can be found elsewhere [31].We use the OM[s] fingerprint with a cutoff radius of R c = 5 Å to build the LS.We take N = 46 which is the same as the length of the fingerprint.Having generated the LS, we assign a different color to each of the corners of the simplex for the following visualizations.These corners are the most distinct environments in the nanocrystalline Al, i.e. each corner can represent a class of diverse environments in the data.We again categorize the atoms in the system according to their similarity to the corners of the LS and assign them the same color as the corners they resemble most.Visual inspection of Fig. 5 shows that the simplex method can find all the grain boundary networks, in agreement with the findings of Piaggi [31].In addition, it can also recognize differences between different grain boundaries and find different kinds of ordered-disordered phases as shown in Fig. 6.
In Fig. 6 we showed the first 20 corners of the LS.Fig. 6a shows a perfect crystalline FCC phase.Figs.6c and 6r show the defective crystalline FCC phases where one nearest neighbor of the central atom is missing.The corners shown in Figs.6e, 6n, 6p, and 6s correspond to atoms on a twisted grain boundary.The configurations from Figs. 6b, 6d, 6h, 6l, and 6t represent environments located on the boundary between ordered and disordered phases.Finally, some corners of the LS represent atoms in disordered phases such as those shown in Figs.6i and  6j.

C. The compression of the fingerprints
In section II we showed that once the LS is found, the original fingerprints can be projected onto the LS.In this section we will show that these projections can be regarded as a new fingerprint whose length is much shorter than the original fingerprint while containing most of the information of the original fingerprint.This is an example of data compression, a problem for which many algorithms are available such as CUR [32] decomposition.Assuming that F is the fingerprint matrix with dimension L × N where L is the length of the fingerprint and N is the number of atomic environments N = N env , i.e. ith column of F contains the fingerprint vector of atomic environment i, one can write F ∼ CU R in which C and R contain k selected columns and rows of F and U = C + F R + where A + indicates the pseudo-inverse of A and k < r = rank(F ).In order to find the reduced selected number of rows of matrix F , one writes its SVD decomposition as F = Ū DV T , where Ū (left singular where u ξ i is the ith component of ξth left singular vector and k is the number of rows that should be selected.Frequently, rows are selected with probability proportional to the leverage score.We employed a deterministic method [37,38] and select the row with the highest leverage score at each time.Then, the selected row is removed from the matrix and the rest of the rows become orthogonalized with respect to it.To select other rows this procedure is repeated.The selected rows are the most important features.One can also select columns of the matrix F , i.e. the most important atomic environments by following the same procedure but for F T .The selected rows and column are stored in R and C respectively.In the following, we employ the LS and CUR method to  Software Ovito [36] is used for the visualization.
reduce the length of the fingerprint by selecting the components of the fingerprint that contain the most important information.
In order to investigate whether the compressed fingerprint conserves the information encoded in the original fingerprint, we correlate all the pairwise fingerprint distances obtained by the original and compressed fingerprints [14].
Obviously fingerprint distances that are large with the original fingerprint should remain large with the compressed fingerprint.In the same way short distances should remain short.If this is the case all the points in a correlation plot between the fingerprint distances arising from the original and the compressed fingerprint will lie on or close to the diagonal.If there are points far away from the diagonal and in particular if some fingerprint distances of the compressed fingerprint are small whereas the original distances are large, there is a loss of information.
In Fig. 7 we show the correlation plot between the original fingerprints and the CUR-reduced and LS-reduced fin-gerprints using OM and SOAP [5] for our above-mentioned test of 1000 C 60 clusters with 1000 × 60 atomic environments.We used the same fingerprint parameters for OM as in section III A. For SOAP, we used the following parameters: l max = n max = 8, r δ = 4.0 Å, σ = 0.5 Å.The cutoff radius is the same 6 Å in both OM and SOAP.The software QUIP [39] is used to generate the SOAP fingerprints.The length L of the original fingerprints is 240 for OM and 325 for SOAP.We reduced the length of the fingerprints to l = 16.As can be seen in Fig. 7, the correlation is perfectly diagonal in the case of LS which indicates that vast majority of the information of the original fingerprint is retained in the LS-reduced fingerprint.There are however some deviations from the diagonal in the correlation plot between the original fingerprint and CUR-reduced fingerprint which indicates that some information is lost in the CUR-decomposition.The length of the reduced fingerprints is l = 16 while the length of the original fingerprint L is 240 for OM and 325 for SOAP.

IV. CONCLUSION
We have introduced an algorithm to construct a largest volume simplex in the space spanned by a large set of atomic environment fingerprint vectors.The number of corners of this simplex gives the effective dimension of the fingerprint vector space.The corners themselves represent landmark environments that can be used to analyse structures with a large number of atoms in a fully automatic way.So, in contrast to other methods, it is not necessary to include into our analysis tool criteria that are based on human expectations of what kind of environments are expected to be encountered in this system.We show that this analysis method can be used to detect grain boundaries and other typical environments in multi-grain metallic systems and to classify atomic environments in a carbon cluster in a way that is consistent with basic chemical intuition.Since only those components of the fingerprint vector that are inside the space spanned by the LS are relevant, projecting the fingerprint into the space spanned by the simplex reduces the length of the fingerprint without any significant loss of information.Therefore the method can also be used as a data compression method for fingerprints.

Figure 1 :
Figure 1: The number of atomic environments in the data set of C 60 structures which are similar to one of the corners of the LS.The blue bar represents environments which are not similar to any corner based on the the threshold value δ = 0.5.

Figure 2 :
Figure 2: The first twenty corners of the LS, i.e. the twenty most distinct atomic environments.The corners are shown in a different color than the rest of the atoms.The relative size and the colors of the atoms is due to the visualization purposes and is not physically important.

Figure 3 :
Figure 3: a) A C 60 with a Stone-Wales defect: The atoms are colored according to their closest corners which is shown by the same color in the other three images.b) corner 47; c) corner 38; and d) corner 23 of the LS.

Figure 4 :
Figure 4: a) A graphite flake whose atoms are colored according to their closest corners.b) corner 55; c) corner 33; d) corner 26; e) corner 25; f ) corner 9; and g) corner 7 of the LS.

Figure 5 :
Figure 5: Nanocrystalline Al containing grain boundaries.The LS is employed to find the grain boundary networks.5a view from top.5b view from front.5c view from left.5d view from right.5e perspective view.5f slice view.Software Ovito[36] is used for the visualization.

Figure 6 : 7 Figure 7 :
Figure 6: The first twenty most distinctive atomic environments in the nanocrystalline Al found by the LS.Red atoms are the central atoms whose local environment is one of the corners of the simplex and the atoms in their vicinity are depicted in blue.