Sensitivity and Dimensionality of Atomic Environment Representations used for Machine Learning Interatomic Potentials

Faithfully representing chemical environments is essential for describing materials and molecules with machine learning approaches. Here, we present a systematic classification of these representations and then investigate: (i) the sensitivity to perturbations and (ii) the effective dimensionality of a variety of atomic environment representations, and over a range of material datasets. Representations investigated include Atom Centred Symmetry Functions, Chebyshev Polynomial Symmetry Functions, Smooth Overlap of Atomic Positions, Many-body Tensor Representation and Atomic Cluster Expansion. In both areas (i) and (ii), we find significant variation across the different representations in these tests, pointing to shortcomings and the need for further improvements.


I. INTRODUCTION
Machine Learning (ML) as a predictive modelling tool has gained much attention in recent years in fields ranging from biology 1 , and chemistry 2 to materials science 3,4 , building on decades of successful applications in image recognition, natural language processing and artificial intelligence (AI) 5,6 While machine learning has also been increasingly popular in many fields due to its relatively simplicity of application and powerful prediction features, a key driving force of is the increasing availability of information through data repositories, archives and databases of materials and molecule that include billions of structures 3,7 . In recent years, a large number of ML models for materials and molecules have been developed and many novel representation methods have been proposed [8][9][10][11][12][13][14][15] . Representations are the feature sets used as input data for datadriven machine learning models, which, after training, can be used to predict quantities of interest.
Some of these databases are focused on molecular configurations such as PubChem 16 , DrugBank 17 and ChEMBL 18 are widely used in biological, chemical, and pharmacological applications driven by high-throughput screening for drug design and novel molecule discovery. Other databases such as the Cambridge Structural Database (CSD) 19 include crystals as well as small molecules. While these databases contain mostly molecular structures with their chemical or biological properties, recent efforts such as the Materials Project 20,21 , Automatic FLOW for Material Discovery Library (AFLOWLIB.org) 22,23 , Open Quantum Materials Database (OQMD) 24,25 , and the NOMAD archive 26 (which collates data from many other databases) now provide extensive electronic structure results for molecules, bulk materials, surfaces and nanostructures in a range of ordered and disordered phases, with chemical composition ranging from inorganic systems to metals, alloys, and semiconductors. With recent advancements in materials databases, access is available a) Electronic mail: b.onat@warwick.ac.uk b) Electronic mail: j.r.kermode@warwick.ac.uk to billions of properties of materials and molecules and millions of high-accurate calculations through online archives. With such an extensive and diverse collection of molecular and materials data to hand, we can ask questions such as how to identify of the most informative subset of data for a particular class of materials, and how best to interpret it for prediction through ML models. These questions are relevant for both classification and regression applications: for example, classifying materials as metallic or semiconducting, or predicting the band-gap from structure, respectively. In both cases, the input data is constructed from information ranging from basic physical and chemical properties to precise geometrical information based on atomic structure and bonding topology. There have been many studies to determine these input information which are named as descriptors, fingerprints or representations 13,[27][28][29][30][31][32] Here, we define the prediction problem as, where t is the target property of the material, χ k represents the structure with index k within the database, M a is the machine learning model with identifier a, and V b is the input representation determined with identifier b that is used for optimizing f . We define representations as where, x j are the coordinates of atom with index j within structure k, α m is a physical or chemical property of the material and chemical environment, h i is the descriptor mapping from the input space to hyper-dimensional space with each dimension being an input feature of model M and g is the encoding function that combines all of the h i descriptors to make an overall representation. In Equation 1, the optimization of the parameters of f can be performed either for a single model how to combine these to form representations. These can vary from defining sets of physical and chemical properties and atomic geometries to specialized hyper-dimensional mapping algorithms (See Figure 1). These representations and their building blocks, the descriptors can be classified into three broad classes based on their construction: (i) atomic neighborhood density definitions, (ii) topology expansions and (iii) property-based selections. The representations can be further grouped according to their combination rules, mapping basis or filtering functions, histograms, connectivity maps or graphs, and finally direct contributions from material properties. Within each category, there have been various developments of representations and many successful applications, which we review briefly in Section II below, before specialising on atomic neighbourhood densities in III. Section IV lists the data sources used here for evaluating representations, and Section V describes our analysis methodology. Results are discussion follow in Section VI.

A. Property-based representations
The idea of classification of molecules based on the relationships between their structure and the resulting activities or properties is an underpinning approach of modern chemistry 33 . However, using theoretical descriptors to identify the relations was first utilised for quantitative structureactivity/property relationship (QSAR/QSPR) modelling 33,34 , starting with the work of Wiener and Platt in 1947, who used indices based on chemical graph theory 33 . We use the name QSAR refer to both structure-activity/property relationship models. QSAR models are highly successful predictive modelling approaches that have been widely applied in chemistry and biology in various applications 33,34 . These models are determined using sets of descriptors that are defined through a selection of physical, chemical properties of materials and/or structural descriptors of molecules, which can be used in Equation 2 to build a representation of V as a set of h descriptors each only including one of the α m values for the selected property. Although these models are mainly based on simple input properties such as atomic numbers, valances, and hydrogen content, they also use chemical topology information, such as atoms and their molecular bonds to represent the nodes and vertexes of graphs providing connectivity information without the need for atomic coordinates [33][34][35] . While QSAR models use graph based topology, they mostly reduce the full connectivity information into indices for a subset of graphs or the fractions of the bonding information. However, the same topology information can also be used to build force fields 36 , for example through the use of tables of bonds, angles and dihedrals as well as non-bonded terms.
QSAR models have mainly been used for classification to enable high-throughput screening of molecules that share similar chemical or physical properties. The success of QSAR mostly depends on the choice of descriptor sets, and in particular, determining the optimum number of descriptors in such sets 34 . For example, compressed-sensing approaches have been developed that consider the optimal set of physical, chemical and geometric properties of materials to allow systematic selection of properties 35,37 .
Direct property-based approaches also employ reduction techniques to identify optimal models M in Equation 1, by choosing mathematical operators for g that combine descriptors from a given set. This optimization scheme, referred to as sure independence screening and sparsifying operator (SISSO), has been successfully applied as a classification technique to group materials (e.g. metals vs. non-metals) and as a regression method to determine a target property of interest. 38 Although traditional QSAR models use only reduced topology information within indexes, a recent extension, atom2vec 13 uses the connectivity information of each species in the chemical stoichiometry but not the full chemical topology information. In this model, stoichiometry of the materials or molecules are modelled as 'words'. A representation is formed by using the overall connectivity information of all product atoms to the chemical environments defined by its stoichiometry. For example, Bi 2 Se 3 can be defined by the bonding between Bi atoms and '(2)Se3' environment with two targets to Se 3 atoms while conversely Se atoms have '(3)Bi2' environments. All such pairs in all materials of a database are represented by a single atom-to-environment connectivity matrix. This simple approximation enables the approach to be used to screen billions of materials and make predictions such as identifying possible candidates for Li-ion battery elements 39 based only on chemical composition and stoichiometry.
Recent efforts also address the direct usage of electronic structure data to define descriptors. In these methods, the essence of the electronic structure information is extracted through histograms over density-of-states (DOS) 40 or over the selected band structures 41 into the descriptor vectors.
B. Topology-based representations QSAR approaches use materials properties directly in their model descriptors; for improved predictive power, one may need to incorporate more structural information to describe chemical bonding and the full atomic-scale topology. In this group of descriptors, topological information is derived from the full connectivity graph of the atomic environment. Examples of this class of methods include the Coulomb and Sine Matrices 42 , n-gram graphs 43 , and Graph based Neural Network (NN) models such as DTNN 44 , SchNET 12 , CGCNN 45 . In all these models, the representations are build on the connectivity map of the atoms and hence from the chemical topology of the material. While Coulomb and Sine matrices use atom-to-atom connection information directly and are thus based on the inferred chemical bonding between atoms, the input representations of Graph Neural Network models are defined through neighborhood analyses of atomic connectivity graphs and can be constructed from the full n-body topology.
Another approach to use topology information such as bonds, angles and higher-order many-body contributions is to group them into sets of building blocks and apply histograms on these blocks over the input datasets. Several successful implementations of these approach are Bag-of-Bonds (BoB) 46 and its analogy to higher order contributions for example, Bag-of-Angles and k-bags.
Topology information provides atomic connectivity data and is very useful for structural similarity comparison, since this information is automatically invariant under changes in the atomic positions. Since all these models use chemical and physical properties of materials and molecules as well as predefined topological information, their models inevitably depend on the availability of information from either experimental results or highly accurate ab initio calculations. For novel material discovery one may need to use electronic structure calculations to generate new data comprising the properties and topology of uncharted material compositions. However, such database construction for new structures is extremely expensive.

C. Atomic Neigbourhood Density representations
Modelling materials and molecules at the atomic scale is essential to identify novel materials for applications of interest or to determine candidate molecules with specific function in a desired medium. When accuracy and reliability of concern, preference is usually given to ab initio calculations. However, these first-principles calculations are inaccessible for systems requiring long time-scales and/or large numbers of atoms. To significantly reduce the computational cost, interatomic potentials or force fields are employed to define the interactions between atoms using parametrized forms of functions in place of the electronic interactions. Although these approaches allow larger numbers of atoms to be included in simulations and provide access to much longer time-scales, their prediction accuracies are typically limited as a result of simplified models describing interatomic interactions.
A prominent recent trend has been to address this limitation with machine learning interatomic potentials (MLIPs), i.e. by replacing fixed functional forms for the atomic interactions with data-driven machine learning models that map from atomic positions to potential energy surface, via descriptors based on the local neighbourhood of each atom. This approach based on atomic-densities extraction has been applied in pioneering models using symmetry functions 8 and more recently the so-called smooth-overlap of atomic positions (SOAP) 9,47 . Symmetry functions are widely used in neural network interatomic potentials (NNP) 8,32,48,49 successfully for many molecules and materials including water, crystals with various phases 50 , amorphous solids in various concentrations 51,52 , while SOAP is typically incorporated in Gaussian Approximation Potentials (GAP) 9,47 . GAP has been applied to many molecules and materials 4,[53][54][55] .
In this final class of descriptors, there have been many developments in recent years. In addition to atom-centered symmetry functions (ACSF) and SOAP descriptors, different basis expansions such as Bispectrum and Chebyshev polynomials have also been utilized in SNAP potentials 56,57 and NNPs 58 , respectively. Further advances are also provided in ML models based on atomic and many-body expansions with tensor representations such as Many-Body Tensor Representation (MBTR) 10 .
An alternative approach is to employ linear regression using a symmetric polynomial basis. This approach was pioneered by Bowman and Braams 59 , while more recent symmetric bases are the Moment Tensor potentials (MTP) 15 and the Atomic Cluster Expansion (ACE) 60 . The MTP and ACE basis are also based on density projections and therefore closely related to the descriptors described above. In particular, ACE can also be seen as a direct generalisation of SOAP and SNAP providing features of arbitrary correlation order.
Recent works have demonstrated that the atomic-density representations can be unified in a common mathematical framework 61 , and that the descriptors can be decorated with additional properties to extend their capabilities, such as in λ -SOAP representations for learning from tensor properties 61 .
For all these ML models, different combinations of descriptors and representations have been used in many studies and for a wide range of materials 55,62,63 . While different approaches have been utilized in these studies, comparative assessments of the approaches have thus far been limited. Studies for the property and topology based descriptors on the QSAR feature sets and the performance of graph based NN models show that selection of optimal descriptor sets are very important to ensure high predictive accuracy of the ML models 63 . A number of studies of the atomic neighborhood density approaches provide analysis primarily based on trained models' precision on available datasets 63 but so far there is no performance assessment on descriptors and representations for materials and molecules focussed solely structure to hyper-dimensional encoding.
Following the works by Huo 10 and Jäger 64 , we identify the essential properties of descriptors and representations for encoding materials and molecules as follows: (i) Invariance: descriptors should be invariant under symmetry operations of rotation, permutation and translation.
(ii) Sensitivity (local stability): small changes in the atomic positions should result in proportionate changes in the descriptor, and vice versa.
(iii) Global Uniqueness/Faithfulness: the mapping of the descriptor should be unique for a given input atomic environment (i.e. the mapping is injective).
(iv) Dimensionality: relatedly, the dimension of the spanned hyper-dimensional space of the descriptor should be sufficient to ensure uniqueness, but no larger.
(v) Differentiability: having continuous functions that are differentiable.
(vi) Interpretability: features of the encoding can be mapped directly to structural or material properties for easy interpretation of results.
(vii) Scalability: ideally, descriptors should be easily generalized to any system or structure with a preference to have no limitations on number of elements, atoms, or properties.
(viii) Complexity: to have a low computational cost so the method can be fast enough to scale to the required size of the simulations and to be used in high-throughput screening of big-data.
(ix) Discrete Mapping: always map to the same hyperdimensional space with constant size feature sets, regardless of the input atomic environment.
In this article, we concentrate our efforts on analysing sensitivity (i.e. local stability) and compressibility (i.e. dimensionality) of the selected set of descriptors and representations, and do not address uniqueness/faithfullness; for an interesting recent investigation of this important issue see 65 .
While there have been many related efforts on the analyses of property and topology based descriptors 35,37,38 , we choose to focus here on the atomic neighborhood density based descriptors along with representative descriptors from three groups of mapping basis functions, tensor representations, and polynomial representations as follows: ACSF, Chebyshev polynomials in SF (CHSF), SOAP, SOAPlite, MBTR, and ACE 60 . We provide an analysis of the sensitivity and local stability of the descriptors as well testing their invariance under symmetry operations. We further assess the descriptors information packing ability using CUR matrix decomposition, furthest point search (FPS) and principal component analysis (PCA) dimension reduction techniques. The remainder of this article is organised as follows: in section II we give a brief summary of the selected representations and descriptors; we then define the methods that we use for the analysis in Section III and the datasets in Section IV. We provide our results and discussions in Section V.

III. ATOMIC NEIGHBORHOOD DENSITY REPRESENTATIONS
The descriptors sharing neighborhood density extraction can be unified in a general atomic expansion following the works in Ref. 60 and 61 and can be defined by, Here, x i is the position of atom i and b i,n can be expressed with radial and angular basis functions or as a polynomial expansion. Following, we will give a brief summary of selected atomic neighborhood density representations following the above definition. These are Atom

A. Atom Centered Symmetry Functions (ACSF)
The descriptors of atom centered symmetry functions were introduced by Behler and Parrinello with their atomic neural network potentials (NNP) 8,66 and further used in a wide variety of applications 8,48,50,52,64,66 . The method extracts atomic environment information for each atom in the configuration using radial contributions given by and angular contributions of the form where the distances R i j = |R j − R i | and angle dependent contributions are defined through their cosines via A i jk = cos θ i jk and cos θ i jk = R i j · R jk /(|R i j ||R ik |) and are invariant under symmetry operations of translations and rotations, hence the name symmetry functions. While many symmetry functions have been proposed 8 , two choices of functions commonly used for b r and b a in many applications 8,49,52,58,67,68 take the radial function to be centered on atom i to be defined as and the angular function centred on atom i as where f c is a cutoff function given by and R c is the cutoff distance.
In this work, we used G 2 and G 4 with two parameter sets: one is the traditional parameter set taken from Ref. 8,48,49, and 66 that is used in many ACSF-based NN potential models, and the second one is extracted from the automatic ACSF parameter generation proposed in Ref. 68. For the rest of this work, we label the representation the original parameter set as ASCF and and the newer systematic extended parameter set as ACSF-X, for which we take the same parameters as Ref. 68.

B. Chebyshev Polynomial Representation within Symmetry Functions (CHSF)
By introducing radial and angular basis that are invariant under translation and rotational symmetries, one can define different functions that provide invariance under symmetry operations. Another example that exploits this idea defines radial and angular functions with where N α is the expansion order and the basis functions φ α and their dualsφ α are defined in terms of the Chebyshev polynomials T α via In Equations 10 to 12, α is the order of the polynomials, k = 1 except for α = 0 where k = 2, and (x, T c ) are taken to be (r, r c ) for radial functions and (A i jk , π) for angular functions, respectively. 58 The atomic descriptors V b are then defined using the set of coefficients c (2) α and c where w t is the weight for species t. For single-species configurations w t = 1, while for multi-species configurations, both structural and compositional parts contribute to the final descriptor.
A practical advantage of using polynomial expansion for radial and angular functions is the reduced number of input parameters as the only parameter for the expansion is the expansion order N α of the Chebyshev polynomials. In this work, we select N α = 9 for both radial and angular polynomial functions.
In this representation, the radial and angular contributions b r and b a are separate functions of distances and angles, respectively. b r is defined by Eqn. 10 and provides a histogram of distances present in the atomic environment. However, angular contributions b a are significantly different. While Chebyshev polynomial variant defines b a to be a histogram of angles using only A i jk in Eqn. 11, ACSF combines both distances and angles in Eqn. 8 and thus defines histogram of triangles.

C. Smooth Overlap of Atomic Positions (SOAP)
Descriptors can be constructed for extracting neighboring atomic environments using the smooth overlap of atomic positions (SOAP) approach 9 . In this method, atomic densities centered at atom positions are defined by a sum of atom-centered Gaussians with the overall atomic density of a structure χ is given byρ and one can build SOAP kernel K(χ, χ ) with where the exponent ζ > 1 and the integral is calculated over all possible rotationsR of the overlapping densities of χ and χ environments. In practice as is elegantly shown in Ref. 9, an equivalent kernel can be rewritten in the form of K(χ, χ ) =p(χ) ·p(χ ) by selecting a set of orthonormal radial basis functions g n (r) and angular basis functions with the spherical harmonic functions Y lm (θ , φ ) to expand the atom centered density at atom i with and using the power spectrum of the expansion coefficients c i nlm , given bŷ where n and n are indices for the radial basis functions and l, m are the angular momentum numbers for the spherical harmonics. In SOAP as it is defined in Ref. 9, the radial basis functions are given by in terms of polynomials. The representation of atomic environment χ is then defined byp(χ), where p i (χ) can be identified as atomic descriptors for atom i. The SOAP descriptors and representation are specified by the expansion orders n max for the radial basis and l max for the angular basis. In this work for compatibility with SOAPlite and with the polynomial expansion order of Chebyshev polynomials in SF, we select n max = 9 and l max = 9.

D. Modified Basis Expansion for SOAP (SOAPlite)
Introducing a different radial basis function and treatment of spherical harmonics basis, a modified version of SOAP referred to as SOAPlite has been proposed recently 64 . In this version of SOAP, radial basis functions are replaced by φ n l (r) = r l e −α n l r 2 (22) where α nl are decay parameters of non-orthonormal functions φ nl (r) that determines the decay of φ nl to 10 −3 at cutoff radius specified by (R c − 1)/n max steps between 1Å and R c . The method also selects the real (tesseral) spherical harmonics for the angular basis as described in Ref. 64. For fair comparison, we select n max = 9 and l max = 9 for SOAPlite with all other parameters taken as the defaults implemented in the GAP 9,47 and QUIP 69,70 codes.

E. Many-Body Tensor Representation (MBTR)
Many-body tensor representation (MBTR) 10 constructs representations of structures by defining contributions from k atoms in k-body terms with g k geometry functions. In MBTR, these contributions from k atoms are smoothed with probability distribution function D and the resulting contributions to the representation are given by where Z are atomic numbers, C z is an element-element correlation matrix consisting of Kronecker δ values, w k are weighting functions and g k are scalars for k atoms while i and j are neighbouring atoms in i = (i 1 , · · · , i k ). A common selection for the functions g k are atomic number In this work, we select g r = g 2 and g a = g 3 for geometry functions with exponential decay function where β is taken to be 4.0 for k=2 and 3.0 for k=3. We select Gaussian distribution for D and the continuous broadened results are discretized to N x = (x max − x min )/∆x values using ∆x steps where N x is 100 and x min , x max with intervals of [0.1, 1.1] and [−0.15, π + 0.1π] for distances and angles, respectively.

F. Atomic Cluster Expansion (ACE)
The ACE 60 method constructs a complete basis of invariant polynomials. Each basis function may be interpreted as an invariant feature, which can then be collected into a descriptor map. Similarly to SOAP, the ACE starts with a density projection, where g n is a radial basis. The atomic positions are not smeared as in SOAP, SOAPlite and MBTR. Isometry invariant features are then obtained by integrating the N-correlations over the symmetry group: for n = (n α ) N α=1 , l = (l α ) N α=1 , m = (m α ) N α=1 we obtain Finally, one selects a linearly independent subset of the B nlm basis functions. A detailed description of this construction is provided in Ref 71. Aside from the lack of smearing and the choice of radial basis the 2-correlation functions are equivalent to SOAP, while the 3-correlation functions are equivalent to SNAP. Since the ACE construction readily applies to higher order correlations we will use up to 5-body correlations in order to test the effect of introducing significantly higher correlations into the descriptor. To control the size of the feature set we use an a priori chosen sparse selection as described in Ref 71. To complete the specification of the ACE descriptors we must define the radial basis: Here we choose where (P n ) is a basis of orthogonal polynomials such that g n (R c ) = g n (R c ) = 0. In this work, the ACE method is used as implemented in the SHIPs package in Ref 72.
In common with ACSF, SOAP and SOAPlite, ACE provides a histogram of triangles by combining radial basis and spherical harmonics in Eqns 16 and 24. In all representations, we select a cutoff distance of R c =6.5Å.

G. Modified Chebyshev Polynomial Symmetry Functions (CHSF-mix)
Noting the histogram of triangles provided by ACSF, SOAP, SOAPlite and ACE through Eqn. 8, Eqn. 17 and Eqn. 25, we examine the contributions from the angular terms with A i jk and radial basis of CHSF. Combining radial b r and angular b a basis expansion of Chebyshev polynomials, we introduced a new b a that combine both with nn l is defined as and α index is substituted with l for angular basis and n, n for radial basis sets with the choice of (x, T c ) that are selected in φ α (R i j ) and φ α (A i jk ) as in Eqns. 13 and 14, respectively.
We analyse the benefit of these novel modifications over CHSF-mix in Section VI A using N n = 9 and N l = 9.

H. Descriptor implementations
All our analysis are carried out using our Descriptor-Zoo code 73 (github.com/DescriptorZoo) that includes implementations of the CUR, FPS and PCA analysis and uses AMP 74 , Dscribe 42,75 , qmml-pack 76 , QUIP 69,70 and GAP 9,47 with its Python interface quippy, aepy 77 a wrapper code for AENET 49,58,78 Fortran code of NN ML model based on ACSF and Chebyshev polynomial descriptors (CHSF), CHSF.jl for both CHSF and CHSF-mix 79 and SHIPs.jl 71 code for ACE representation (labelled ACE-SHIPs in results below).

IV. DATASETS OF MATERIALS AND MOLECULES FOR ANALYSES
We used a wide range of materials and molecules databases to provide datasets to test the various representation methods. For diversity, we selected a range of materials and molecular systems: Si for single species tests, TiO 2 for metal-oxides, AlNiCu for metals and metal alloys, and molecular configurations containing the elements C, H, O, and N.
Si dataset: This dataset was constructed using the available GAP Si potential database from Ref. 80 plus Si molecular dynamics (MD) database from Ref. 50 . While the overall dataset includes various crystalline phases of Si, it also includes MD data. This dataset includes 3,583 structures with 242,139 atomic environments.
TiO 2 dataset: We used a TiO 2 dataset that was designed to build atom neural network potentials (ANN) by Artrith et al. 49,58 using the AENET package. This dataset includes various crystalline phases of TiO 2 as well as MD data that is extracted from ab inito calculations. The dataset includes 7,815 structures with 165,229 atomic environments in the stochiometric ratio of 66% O to 34% Ti.
AlNiCu dataset: This dataset is formed from two parts: single species datasets for Al, Ni, and Cu from the NOMAD Encyclopaedia and multi-species datasets that include Al, Ni and Cu from NOMAD Archive. All single species data was fetched from the NOMAD Encyclopaedia, after removing duplicate records with degenerate atomic environments (e.g. equivalent structures from different ab initio calculations uploaded to NOMAD). For the multi-species data, we used only the last configuration steps for each NOMAD Archive record, since these records include all intermediate calculation cycles, with the last configuration entry typically corresponding to a fully relaxed configuration. In our dataset, the NOMAD unique reference access IDs are retained along with a subset of their meta information that includes whether the supplied configuration is from a converged calculation as well as the DFT code, version, and type of DFT functionals with the total, potential energies. This dataset consists of 39

V. ANALYSIS METHODS
Our analysis for the representations are based on the desired features of encoding structural information of the materials that are listed above as invariance, sensitivity, dimensionality, differentiability, interpretability, scalability, complexity, and discrete mapping. While many of these features are important according to the application, we focus here on the invariance, sensitivity, and dimensionality.
Invariance of a representation or a descriptor under symmetry operations such as translation and rotation is of high concern in developing mapping methods since the properties of a material should be identical under these changes of the configuration description. The structural representations are constructed to follow these conditions otherwise every possible transformation of the material must be included in the training of the machine learning model, leading to unaffordable numbers of permutations. Even a successful construction of such a model can lead to undesired predictions for the uncharted permutations that are not in the training datasets.
Sensitivity is also an important property of a descriptor since any application needs distinguishable and unique values for the descriptors. How sensitive the descriptor is to changes in the structure of the material determines the outcome of similarity analysis or molecular dynamics simulations. For example, if a descriptor produces exactly the same values for any perturbation, the outcome will be indistinguishable. In MD simulations, such insensitivity will result in inaccurate Dimensionality, on the other hand is more related to the under-or over-determination of the feature space for the machine learning application. While under-determination in the mapping of hyper-dimensional space can easily lead to inaccurate predictions of an ML model, over-determination may also lead to undesired predictions according to how the the over-determined features are eliminated.

A. Invariance and Sensitivity Analysis
Our analysis of invariance, local stability and sensitivity are carried out using diamond cubic Si structures. All the selected descriptors in this work start from neighbour analysis based on atom-centred perspective, as described in Sec III and therefore the descriptors generate values invariant under translational symmetries by definition. However, whether they can maintain invariance under rotation needs to be verified. To evaluate this, we perturbed 4 × 4 × 4 cubic diamond crystal Si (c-Si) as follows: rotating the structure around the y-axis in a non-periodic unitcell, we calculated the difference of descriptors and representations, dV all from the reference non-rotated structure (See Fig 2a).
For the sensitivity analysis, we perform three types of perturbations to a 4 × 4 × 4 c-Si unitcell as follows: 1. Mixed Perturbation: the central Si atom with dark blue color in Fig 2b is moved along the [100] direction (i.e. line joining the light green atoms on x-axis) within a periodic supercell by a distance dx, ranging from 10 −8 Å to 0.1 Å.
2. Radial Perturbations: The atoms in the groups of 1 st , 2 nd , 3 rd , and 4 th neighbour atom shells at different distances from the central dark blue atom are perturbed along radial and tangential directions (see Fig 2c). For the radial perturbation, atoms in each shell are moved along the vector that separates the atom from the central atom. Position change dx ranges from 10 −8 Å to 0.1 Å.
3. Tangential Perturbation: Neighbouring atoms in the same shells are perturbed on the sphere inscribed by the distance vector R i j (See Fig 2c). This perturbation only changes angular contributions of the descriptors since the radial distances R i j are kept fixed. Position change ranges from 10 −5 Å to 0.1 Å.
For case 1, we look at the difference in the full structure representation dV (comprising all atomic descriptors) with respected to the unperturbed crystal to investigate whether the full representation is sensitive to small perturbations of a single atom in the structure. In this mixed perturbation test case, all the neighbor distances for the perturbed central atom change from the perspective of the central atom i, and the difference dV from the reference unperturbed structure depends on both radial and angular contributions.
For cases 2 and 3, we look at the difference of the descriptor dV i of the central atom i with respect to an unperturbed neighbourhood. This allows us to test the sensitivity of the individual radial and angular contributions.

Sensitivity to perturbations
We consider the question of whether the representation V = V ({R i j } j ) changes in a locally smooth and stable manner near some reference configurationR = {R i j } j . Small changes in the configuration should lead to proportionate changes in the representation, a property that we called sensitivity. This requirement is necessary to represent an arbitrary smooth function with the same symmetries and to inherit its regularity, which in turn is key to obtaining accurate fits with few parameters or basis functions.
To illustrate this concept, consider a "feature map" v : R → R which is strictly increasing and hence invertible, but assume v(x) ∼ a 2 x 2 as x → 0. It follows that x(v) ∼ v/a 2 , i.e. the inverse has a singularity. Suppose now that we wish to represent the linear function f (x) = x as f (x) = g(v(x)) then g(v) = f (x(v)) = x(v), i.e., it inherits the singularity which makes it challenging or even impossible to obtain an accurate fit.
In general, we consider paths R(t) = {R i j (t)} j with R(0) = R and expand the change in the descriptor to leading order, for some k ≥ 1. We call a descriptor V linearly stableR if k = 1 for all possible perturbation paths, i.e., if the change in the descriptor is linear as the perturbation amplitude t → 0. If k > 1 then we call V linearly unstable atR. In our sensitivity analyses we choose different perturbation paths R i j (t) leading to different paths in descriptor space V t . From (30) we then obtain that is we can observe the stability or instability of a descriptor by analysing the slopes on a logarithmic scale. A linear slope, i.e., linear stability is guaranteed to fulfil our sensitivity requirement. However, in certain high symmetry settings this requirement must be relaxed as we will see in Sec. VI A.

B. Dimensionality Analyses
The descriptors analysed here are all constructed from feature sets extracted from the structural mapping of the atomic neighborhood density to hyper-dimensional spaces. Their central objective is the requirement to cover all possible perturbations of the structure to ensure a faithful representation to the ML model. However, strictly following this concern can lead to over-determination in the hyper-dimensional space. In other words, the representation may cover only a small subspace of the full hyper-space, with the subspace depending on the parameter set used. In the case of over-determination, feature sequences may contain many zero entries, or, for multispecies systems, may need to be padded with zeros to account for species missing from individual environments. Both of these cases lead to high sparsity in the descriptors, which in principal could be eliminated by carefully selecting parameters to removing unnecessary features from the final descriptor sets and hence from the representations. Moreover, using an over-determined mapping is likely to induce overfitting and noise in subsequent ML training. In ML applications, such non-informative data should be eliminated before the actual training of the model to reduce the error in the training and increase the accuracy of the resulting models.
Due to the well-known curse of the dimensionality, the dimension of the parameter space of a global optimization problem is the key determiner of the difficulty of obtaining optimal solutions. When representations form the input data for an optimization problem, their dimensionality thus has a crucial role in determining complexity. As the dimensionality increases, the number of possibilities rises combinatorially, drastically hindering the task of optimization. To keep the features at an affordable level for optimization while maintaining an accurate description, one can use dimensionality reduction techniques such as CUR decomposition 68 , farthest point sampling (FPS) 68 , Pearson correlation coefficient (PC) 68 , or principle component analyses (PCA) 50,52 . While these techniques help to identify the most informative features in the descriptors, they also help to analyse how the features of the descriptors can provide sufficiently informative data through an analysis of its "compressibility", and whether the representation leads to an over-determined embedding. Here, we have used CUR and FPS as implemented in Ref. 68 as well as PCA to select the optimum number of features for the descriptors.

VI. RESULTS AND DISCUSSIONS
A. Sensitivity

Sensitivity to Rotations
In Fig. 3, we present the norm of the difference vector dV between the full structure representations of the rotated and the reference c-Si system for each approach considered.
Our analyses show that all descriptors maintain the rotational invariance with high precision, with all errors below 10 −9 (above machine precision ε of ∼ 10 −16 ). Together with built-in invariance with respect to translations and permutation of like atoms of all the representations based on density projections, our analyses indicates that all approaches considered in this work fulfill the properties of invariance in rotation, translation and permutation for the structural representations.
For the sake of comparison of the outputs from different approaches considering the numerical precision of underlying codes, we selected 10 −8 as the lower bound in all our subsequent sensitivity analyses following the lowest precision observed in these rotation tests.

Sensitivity to Perturbations
Further analyses are carried out for the sensitivity of representations under the atomic motions in structures as described in Section V A. Recall from Sec V A 1 that a slope of one indicates local stability (and smooth invertibility of the descriptor) while a slope greater than one leads to singularities in the representation. Fig 4 shows the results for the Mixed Perturbation with representation changes dV normalised so that all curves pass through the point (0.1, 0.1) to enable direct comparison. All representations have linear sensitivity within the entire range of the path, except for MBTR which shows a mild preasymptotic sign of instability (change of slope from 1 to 2 above 0.01 Å), which is unlikely to cause any significant deterioration in stability of the representation.

Radial perturbation of reference crystal
A deeper analysis of the sensitivity of representations can be made by analysing the responses of the atomic descriptors to different perturbation modes. As described in §V A, we calculated the change in the descriptor of the central atom i to a radial perturbation of a neighbour j as shown in Fig. 2c. The sensitivity curves corresponding to the radial perturbation of a neighboring atom in the 4 th shell are given in Fig. 5 (a). All descriptors have slope 1 sensitivity curves, indicating linear stability under this perturbation. This is unsurprising since all descriptors provide a relatively high resolution of the 2-body histogram.

Tangential perturbation of reference crystal
Next, we repeat the test of the foregoing section with a tangential perturbation of an atom in the 1 st shell. The resulting sensitivity curves are given in Fig. 5 (b), clearly showing slope 2 for all descriptors. Thus, according to § V A 1 all descriptors are unstable with respect to tangential perturbations, raising concerns due to the resulting singularity in the inverse of the descriptor map. However, the origin of this instability is invariance with respect to reflections about a plane, and fitting any target function with the same reflection symmetry need not be affected by the singularity in the inverse descriptor map.
Concretely, let i denote the centre atom and k the neighbour that is being perturbed in the tangential direction, i.e., where dR i j = 0 for j = k and dR ik = 1 and dR i j ⊥ R 0 i j . If R 0 is symmetric under reflection through the plane that contains the origin and is orthogonal to dR ik (this is the case here) then the configuration R −t is the reflection of R t to within O(t 2 ) accuracy. Since all descriptors V we consider are invariant with respect to reflections they necessarily satisfy d dt V t | t=0 = 0 (this is true for any function of the distances r i j and the cosines A i jk = cos θ i jk ), and hence V t ∼ a 2 t 2 as t → 0. In particular, the inverse descriptor map V → R must contain a square-root singularity along the path V t .
On the other hand, assume we aim to represent a property, e.g., site potential, ε = ε({R i j } j ), then ε will also satisfy this reflection symmetry, which indicates that the squareroot singularity is again removed. To illustrate this point further we modify the one-dimensional example from Sec V A 1: Assume that we wish to represent f (x) = x 2 = g(v), then g(v) = f (x(v)) = x(v) 2 ∼ v/a 2 as v → 0, i.e., the singularity is removed in this case. More generally, this occurs whenever f (x) ∼ b 2 x 2 as x → 0, i.e., when f is symmetric about the origin to leading order.

Tangential perturbation of a perturbed crystal
The analysis of the previous paragraph suggests that any perturbation in the configuration of the structure that breaks the symmetry should lead to the linearly stable slope-1 cases. We therefore test which descriptors are capable of capturing this symmetry breaking. We break the symmetry in several ways: we perturb an atom in a random tangential direction; we perturb atoms in the second shell which doesn't exhibit the same reflection symmetry; and we perturb the reference crystalline structure in the radial direction from a chosen centre atom i, before applying these tangential perturbations. The results are shown in Fig 5(b)-(d).
As predicted, any such symmetry breaking leads to changes in the slopes of sensitivity curves of descriptors in the limit t → 0. However, there are differences across descriptors how well the symmetry breaking is captured. First, there are some variations across descriptors how significant the preasymptotic slope-2 regimes are, which indicate a reduced sensitivity. However, the most concerning effect is the "dip" in the CHSF descriptor in Fig 5 (c), highlighting a region of significantly reduced sensitivity (it can almost be thought of as blindspot) for atomic displacements in the descriptors, where the perturbation does not change the output values of representations. To test whether adding additional features can remove this dip we implemented an extended CHSF descriptor, labelled CHSF-mix, for which the radial and angular histograms are fully mixed giving a similar descriptin of the 3body histogram as SOAP and ACE do. This addition clearly removes the reduced sensitivity regions.

B. Dimensionality of Representations
In the second phase of our analysis, we consider four different datasets selected from those described in Section IV, namely Si, CHON, AlNiCu, and TiO 2 . Each dataset contains a diverse range of configurations with thousand of structures. To identify how the dimensionality of the representations change with different datasets using the same parameters, we used CUR and FPS feature selection techniques and analysed the reduced dimensions of the representations by comparing them with the outcomes of PCA calculations. This analysis can also be accounted as a measure of the the compressibility of each representation.

CUR decomposition
As a first step we analysed the representations including all element-wise descriptors. MBTR is considered as a representation-only approach as its output cannot be broken down to element-wise descriptors. In Fig. 6, the total error between the full feature sets of each dataset and the reduced feature sets that are extracted from CUR analyses are presented. As each approach provides a different number of features for the datasets at hand depending on the selection of (hyper)parameters, we provide a complete list of the number of non-zero features in each representation with the corresponding datasets in Table I and in the legends of each panel of Fig. 6.
After removing any features of ACSF-X that are all zero from Si and AlNiCu dataset (see Fig 6a), the selection method CUR in Section V B is applied throughout the dataset and features from the full feature set are selected one-by-one and added to the new feature set by calculating the error with respect to the full feature representation. Using this method, the features that contribute most to the representation can be selected. As the lower contributions are added to the new feature set, the error cannot be reduced more and the overall error becomes constant, equal to zero within numerical precision. The number of features that are selected at the beginning of this plateau can be counted to determine the size of the compressed representation, and conversely the number of remaining features can be thought of as the over-determination of the representation.
In Fig. 6, one can see three types of results: (i) those with error curves that gradually decrease to the point where the error plateaus as for the TiO 2 results of ACE, ACSF-X and MBTR; (ii) errors decrease step-wise to a plateau such as in SOAP results; and (iii) where errors drop rapidly as for the AlNiCu results in ACSF, ACSF-X, ACE and CHSF.
SOAP and SOAPlite, in Figs. 6c and d, respectively, can be directly compared since they differ only in the choice of radial basis function. The results for the Si, CHON, and TiO 2 datasets can be examined for both approaches. While SOAPlite has close to exponential decay up to ∼70% of selected features, in SOAPlite this regime extends only up to 30%. After this, the SOAP error has a more step-like character.
Similar behaviour can also be seen in type (iii) results such as ACSF, ACSF-X, CHSF, and ACE. For these methods, While the first ∼10% of features vary the error reduction, the rest of the features only slightly reduce the error.
In most of our dimensionality reduction results, one can see distinct constant error regions associated with overdetermination. These features should ideally be eliminated before using the representation in ML models. For example, consider the results for ACSF-X and ACSF in Fig. 6a and b. As the aim of ACSF-X is to extend the number of features in set from the widely-used and well-tested standard ACSF parameter set, some of the features are expected to be irrelevant for representing structures in our datasets. It is thus unsurprising that the dimension reduction analyses in Fig. 6a shows that there only a fraction of features contribute significantly   -around 14% for Si, 33% for CHON, 54% for AlNiCu, and 33% for TiO 2 . A smoother feature reduction can be seen for the standard ACSF parameter set in Fig. 6b, where the error decay is very similar with about 85% of the total features sufficient across all four datasets and thus around 15% of redundant features.
A similar result is seen for SOAP, where around 75% of the features of SOAP representations are sufficient to cover the structural variance across all datasets. This result is more striking than that for ACSF since SOAP has about an order of magnitude more features in its representation. We can conclude that both ACSF and SOAP are robust approaches that cover the hyper-dimensional space of structural representation for a wide-range of crystals and molecules.
The ACE AlNiCu results are significantly different than the other datasets. To identify whether the degree of the polynomial is the reason for this pattern, we carried out additional analysis with ACE, increasing the degree of polynomial from 6th to 15th degree in steps of 3 degrees, as shown in Fig.  6h. Increasing polynomial degree significantly increases the number of features; however, we find that the percentage of selected features on the final representation does not change significantly and the pattern of error decay is still quite different from the rest of the datasets (e.g, Si, where 25% of the features can be removed from the representation although it has order of magnitude less features in the descriptor vectors than with 15th degree of polynomial expansion. To further investigate the extensive redundancy identified for MBTR features across all four datasets, we consider whether the discretised smearing of positions and angles used by the MBTR representation leads to clustering of the features representation space. To identify any clustering of the data, we perform standardisation of the features for all the representations of Si that are generated by MBTR and show the . When compared with Si curve, Si(STD) does reduce the error when less than ∼20% of features are selected by around an order of magnitude in comparison to the raw representation. However, standardization suggest an even smaller feature set selection of about only 20% of the full feature set. This may be due to the Gaussian smearing with D in MBTR that significantly increase the correlation between features.

Principal Component Analysis
The CUR selection process is closely related to the principle components of the representation data for each dataset. To show whether these principle components are related to the final feature selections in CUR, we further analysed the datasets using PCA. In Fig. 7, the fraction of variance explained by the principle components of the four datasets are given for each representation. Since the PCA variances decrease from one to very small values, we determine a lower bound after which we consider the variance to be zero, shown as the zero level in our figures. PCA variation results for principle components follow the same trends as the CUR curves discussed above. Although CUR results show directly the hyper-dimensional space of the representation, PCA results are not solely based on the selected features but are a collective property of all features based on the covariance matrix from which the principle components are extracted. As seen in Fig.7, the outcome of this selection following the highest to lowest variances and extraction of the features with highest values in the covariance matrix gives similar results to the CUR decomposition.

Furthest Point Sampling
CUR and PCA are both based on SVD decomposition, selecting the features as orthogonal dimensions of the hyperdimensional space, treating each feature as linearly independent since linearly dependent features cannot span the space. However, neither of these methods consider if the selected features have non-linear dependencies on other features. Another approach to select features without using SVD decomposition is Furthest Point Sampling (FPS). In FPS, the feature selected at each iteration is chosen as the farthest from those already selected. Hence, FPS does not provide information on whether the selected feature is indeed linearly independent to those already selected. This can be understood by considering each features' distance from the others at a time late in the process when there are few remaining features to be selected. These remaining features either represent very small distances as minor additions to the previously selected and clustered features or repeat similar distances.
In Fig. 8, the FPS results for the feature selection show that there are significant differences in error reduction and hence dimension compression for SOAP, SOAPlite, ACE and MBTR representations in comparison to our earlier CUR results. However, these FPS results do not allow insight into each features' contribution as a dimension in the hyperdimensional space. The relatively small reduction of errors in FPS selection or the constant regions are due to its selection criteria, which is based only on hyper-spatial distance. This poses a limitation in the analysis if one would like to find the full extent of the representation and remove non-informative features from the descriptors. However, determining the cutoff for the features according to the error is not obvious since there may not be a plateau in the error -as is seen for in ACSF-X, where only 25% of features contribute -but instead a more gradual decrease as in the results for all other representations.

VII. CONCLUSION
We have carried out a comprehensive assessment of the sensitivity and dimensionality of atomic environment representations, using several methods to analyse the sensitivity under rotation and various perturbations. Our results show that although many representations provide an overall acceptable accuracy for sensitivity, there is still room to balance sensitivities to radial and angular perturbations. A full understanding of local stability is essential for using these representations in regression models in ML. We thus conclude that further investigation of how insensitivities effect applications of interatomic potentials and hence observables in MD simulations is necessary to improve ML driven simulation approaches.
We also carried out dimensionality analysis of various representations, which has identified significant opportunities to eliminate unnecessary information that may reduce the accuracy of predictions from ML models. This is expected to become increasingly important as more complex representations are developed, and especially when incorporating property-based descriptors alongside atomic environment representations.

VIII. DATA AVAILABILITY
The data that supports the findings of this study are openly available in github.com/DescriptorZoo/Materials-Datasets at http://doi.org/10.5281/zenodo.3871650, Version v1.0 that are extracted from open-access NOMAD Archive (http://nomadcoe.eu) 81 . The details of all datasets are given at Section IV. The corresponding citations for other data that are used in this study are available from the following publications: GAP Si potential database from Ref. 80 , Si molecular dynamics (MD) database from Ref. 50 and TiO 2 dataset from Ref. 49 through web access from Ref. 82 . The codes that are used to create representations are detailed at Section III H.