A smooth basis for atomistic machine learning

Machine learning frameworks based on correlations of interatomic positions begin with a discretized description of the density of other atoms in the neighbourhood of each atom in the system. Symmetry considerations support the use of spherical harmonics to expand the angular dependence of this density, but there is as yet no clear rationale to choose one radial basis over another. Here we investigate the basis that results from the solution of the Laplacian eigenvalue problem within a sphere around the atom of interest. We show that this generates the smoothest possible basis of a given size within the sphere, and that a tensor product of Laplacian eigenstates also provides the smoothest possible basis for expanding any higher-order correlation of the atomic density within the appropriate hypersphere. We consider several unsupervised metrics of the quality of a basis for a given dataset, and show that the Laplacian eigenstate basis has a performance that is much better than some widely used basis sets and is competitive with data-driven bases that numerically optimize each metric. In supervised machine learning tests, we find that the optimal function smoothness of the Laplacian eigenstates leads to comparable or better performance than can be obtained from a data-driven basis of a similar size that has been optimized to describe the atom-density correlation for the specific dataset. We conclude that the smoothness of the basis functions is a key and hitherto largely overlooked aspect of successful atomic density representations.

Machine learning frameworks based on correlations of interatomic positions begin with a discretized description of the density of other atoms in the neighbourhood of each atom in the system. Symmetry considerations support the use of spherical harmonics to expand the angular dependence of this density, but there is as yet no clear rationale to choose one radial basis over another. Here we investigate the basis that results from the solution of the Laplacian eigenvalue problem within a sphere around the atom of interest. We show that this generates the a basis of controllable smoothness within the sphere (in the same sense as plane waves provide a basis with controllable smoothness for a problem with periodic boundaries), and that a tensor product of Laplacian eigenstates also provides a smooth basis for expanding any higher-order correlation of the atomic density within the appropriate hypersphere. We consider several unsupervised metrics of the quality of a basis for a given dataset, and show that the Laplacian eigenstate basis has a performance that is much better than some widely used basis sets and competitive with data-driven bases that numerically optimize each metric. Finally, we investigate the role of the basis in building models of the potential energy. In these tests, we find that a combination of the Laplacian eigenstate basis and target-oriented heuristics leads to equal or improved regression performance when compared to both heuristic and data-driven bases in the literature. We conclude that the smoothness of the basis functions is a key aspect of successful atomic density representations.

I. INTRODUCTION
Machine learning (ML) has gained increasing importance in the field of atomistic modeling during the last decade. Successful applications have involved both supervised learning (often in the form of regression of atomic-scale properties) and unsupervised learning (e.g., clustering of large compound/structure databases). Most ML methods, whether supervised or unsupervised, rely on representations of the atomic structures of interest. These representations are constructed as a set of numerical descriptors (or features) which act as the inputs to the ML model. Although several classes of descriptors are now available, their performance is often limited by the choice of basis that is used to project the atomic density correlations 1 . As a result, the identification of a suitable basis is a key step in building more effective descriptors and, ultimately, ML models.
Desirable properties of atomic-scale descriptors include equivariance, uniqueness, interpretability, and low computational cost 2 . However, most of these are properties of broad classes of descriptors, which will not be our focus here. Instead we shall confine our attention to properties that are directly affected by the choice of basis, such as low dimensionality and high information content of the descriptor space, sensitivity to changes in the atomic configuration, and good performance on regression tasks. The spherical harmonics are (almost 3 ) always used as the angular basis for atomic density expansions 1,[4][5][6][7][8] . This is because of their symmetry properties with respect to rotation and inversion, which make it possible to generate features with the desired equivariant behavior. However, the best choice of the radial basis is still an open question, with a variety of different radial bases currently in use in different packages for atomic-scale represen-tations and machine learning 4,[9][10][11][12] .
Radial bases have traditionally been regarded as a component of atomic density-based approaches with a very large design space, but perhaps not a significant effect on the quality of descriptors. As a result, the radial basis functions have typically been chosen for their simplicity and/or computational efficiency. However, these considerations become irrelevant when one uses cubic splines to evaluate the radial basis functions, which can then all be computed equally efficiently 11 . This is true regardless of their complexity, and regardless of whether the basis is "mollified" as a result of Gaussian smearing of the atomic densities. The focus therefore shifts to the search for a radial basis with optimal properties, regardless of its complexity.
In the present paper, we shall show that the radial basis, and, perhaps even more importantly, how it is truncated, can have a significant impact on the quality of the resulting descriptors. Following a comparison with several other radial bases, ranging from traditional bases to more recent data-driven contractions 13 , we identify the Laplacian eigenstates within a sphere around each atom 14,15 as a particularly elegant basis that results in density expansion coefficients and structural descriptors exhibiting all of the desirable properties listed above. We then proceed, motivated by the recent focus on systematic body-ordered expansions 3,7,8 , to investigate the behavior of this and other bases in the context of the higher-order equivariant features that arise in the symmetry-adapted expansion of many-body density correlations. We find that the LE basis, and the truncation strategy we propose, can be successfully combined with several of the heuristic arguments that are commonly applied to the construction of machine-learning potentials 7, [16][17][18] .

A. Density expansion and density correlations
Machine learning of properties of local atomic environments begins by constructing an equivariant discretization of the neighbor density around each atom i in a structure A where r ji = r j − r i is the vector between the central atom and its j-th neighbor, g(x) is a localized density function (usually a Dirac delta function or a Gaussian), the c i nlm are the discrete expansion coefficients for the i-th environment, and the ψ nlm (x) are orthonormal basis functions. A number of different basis sets have been proposed for this purpose, the vast majority of which use spherical harmonics to represent the angular dependence of the density in conjunction with various radial basis functions R nl (x): We use real-valued spherical harmonics Y m l (x) so that the expansion coefficients c i nlm are real. The same expressions can be written in a form that borrows from the Dirac bra-ket notation of quantum mechanics, ⟨x|ρ i ⟩ = j∈A ⟨x|r ji ; g⟩ ≈ nlm ⟨x|nlm⟩ ⟨nlm|ρ i ⟩ , (3) in which the terms are in one-to-one correspondence with those in Eq. (1). The expansion coefficients can be written as A pedagogic introduction to the use of this notation is given in Section 3.1 of Ref. 1. We will employ both the functional and the braket notation, using the former when discussing specific basis functions and the latter when manipulating expansion coefficients in a way that does not depend on the choice of the basis. In particular, we will use bra-ket notation to express the equivariant coefficients that discretize the ν-point correlations of the neighbor density as ⟨q|ρ ⊗ν i ; σ; λµ⟩. This notation singles out the components of the ν-point density correlation ρ ⊗ν i that acquire a phase of (−1) λ σ under inversion and transform under rotation as Y µ λ . The indices in the ket thus identify the symmetry properties of the features, while the q in the bra is in general a composite index that we use to enumerate concisely all features associated with a given representation that share the same symmetry.
As discussed in Ref. 8, these equivariant features can be built by starting from the density expansion coefficients that define the ν = 1 equivariants ⟨n|ρ ⊗1 i ; 1; λµ⟩ = ⟨nλµ|ρ i ⟩ (5) and iterating an angular-momentum sum rule ⟨ql 2 ; nl 1 |ρ ⊗ν+1 i ; σ; λµ⟩ = m1m2 ⟨n|ρ ⊗1 i ; 1; l 1 m 1 ⟩ × ⟨q|ρ ⊗ν i ; σ(−1) l1+l2+λ ; l 2 m 2 ⟩ ⟨l 1 m 1 ; l 2 m 2 |λµ⟩ , (6) which uses modified Clebsch-Gordan (CG) coefficients ⟨l 1 m 1 ; l 2 m 2 |λµ⟩ that are appropriate for the combination of real spherical harmonics. The atom-centered density correlations (ACDC) that can be constructed from the ⟨q|ρ ⊗ν i ; σ; λµ⟩ encompass many of the representations that have been proposed over the past decade, including atomcentered symmetry functions 19 and the SOAP power spectrum 4 (both equivalent to ν = 2 invariants), the SOAP bispectrum 6 (equivalent to ν = 3 invariants), λ-SOAP features (ν = 2 equivariants), and more recent systematic expansions such as the atomic cluster expansion 7 , in which the invariant correlations are used as a basis to obtain interatomic potentials as linear fits. Moment tensor potentials 3 , achieve a similar goal with a formalism based on Cartesian rather than spherical coordinates. Equivariant neural networks [20][21][22][23][24][25] can also be shown to be very closely related to this construction 26 . Given that the number of equivariant features generated by iterating (6) grows as the size of the density expansion basis to the power ν, 27 the choice of a concise yet informative discretization is crucial to keep the computational cost under control.

B. Laplacian eigenstates
There are good reasons why the spherical harmonics are used to expand the angular dependence of the density. They are the basis functions of the irreducible representations of SO(3), and they are the eigenstates of the kinetic energy operator on the surface of a sphere (the Legendrian). Because they are irreducible representations of SO(3), the set of (l max + 1) 2 spherical harmonics Y m l (x) with l ≤ l max provides an equivariant representation of the density that treats all rotated densities on an equal footing, and because they are the Legendrian eigenstates with the smallest eigenvalues, this is the smoothest set of (l max + 1) 2 orthonormal basis functions one can construct within a sphere, in the sense that they have the smallest mean square gradients over the sphere [see Eq. (10)].
All of the direct product bases of spherical harmonics and radial basis functions that have been proposed in the past lead to equivariant representations of the density. However, none of these previously proposed basis sets is is explicitly built to optimize some quantitative measure of its smoothness. Optimum smoothness is achieved by choosing the combinations of spherical harmonics and radial basis functions to be the lowest energy eigenstates of the kinetic energy operator within the sphere. This ensures that the combined radial and angular basis respects the uniformity of three dimensional space within the sphere in the same way as the spherical harmonic basis respects the isotropy of the two dimensional surface of the sphere.
For a sphere of radius a, the resulting Laplacian eigenstate (LE) basis functions are the solutions with the lowest eigenvalues of the equation subject to the boundary condition and the orthonormality condition The solutions are given by Eq. (2), with radial functions R nl (x) whose explicit form is given along with that of the eigenvalues E nl in Appendix A.
The connection between small eigenvalues of −∇ 2 and smooth eigenfunctions can be understood by considering the Dirichlet problem applied to the 3D sphere. As solutions of the more general Dirichlet problem in a region Ω of N -dimensional real space x N , the eigenstates of −∇ 2 have the following property: each successive eigenstate u(x), starting from the one with the lowest eigenvalue, minimises the Rayleigh quotient subject to the constraint that u(x) is zero on the boundary of Ω and orthogonal to all eigenstates with smaller eigenvalues. If we take Q as a measure of smoothness, the LE basis truncated via the eigenvalue criterion is thus the smoothest possible basis within the region Ω, which in our case is the 3D sphere, and the eigenvalue E nl quantifies the smoothness of each solution. Truncating the LE basis by retaining the functions with E nl ≤ E max is analogous to truncating the spherical harmonic basis by retaining the functions with l ≤ l max , and it reduces the specification of the basis to a single parameter. The constraint that E nl ≤ E max implies that fewer spherical harmonics will be retained in the basis as n increases, so it will contain fewer functions than a direct product basis with the same values of n max and l max . This is illustrated in Fig. 1, which shows the regions of the n, l plane that are included in the LE bases with E max = E nmax0 for various values of n max .

C. Basis-set smoothness
Smoothness is a desirable feature of machinelearning schemes because it results in a well-behaved interpolation of machine-learned properties to configurations that are not present in the training set. This is an established concept in the atomistic machine learning literature, where smoothness is often enforced through the regularizing effect of including penalty terms in the loss. Smoothing strategies range from the naive Gaussian prior of kernel ridge regression (KRR) to more sophisticated regularization schemes such as that of Ref. 28. The Gaussian smoothing of atomic densities first introduced in SOAP 5 can also be considered as a regularization. For models involving a basis set representation of the atom density, the regularity of the approximation also depends on the smoothness of the basis functions. Here we intend to show how using a quantifiable smoothness criterion to guide both the choice of the basis set and the truncation of higherorder density correlation features is beneficial to the performance of atomistic ML models.
The Rayleigh quotient in Eq. (10) is clearly just one of many possible measures of the regularity of a function: one could take its mean square curvature, its Lipschitz constant, etc. Perhaps the best way to motivate our choice is to note that the solutions of the Dirichlet problem with periodic boundary conditions are plane waves. The notion that a truncated Fourier expansion is a natural way to enforce regularity is well established, and indeed in plane wave electronic structure codes it is customary to express the basis set truncation in terms of a kinetic energy cutoff. The LE construction yields an analogous basis that is adapted to the symmetry and the boundary conditions of the present problem. This idea has already been exploited in atomistic machine learning: Ref. 15 and the existing DimeNet framework 14 already use spherical Bessel functions to evaluate radial descriptors, the latter arguing that their bounded maximum frequencies confer both smoothness and stability on the resulting model predictions.

D. Unsupervised measures of basis-set quality
We shall now describe three different metrics that can be used to assess the quality of basis sets such as the LE basis introduced in Sec. II.B: a residual variance 13 metric ℓ X , a residual Jacobian variance metric ℓ J , and a Jacobian condition number 29 metric ℓ CN . For the first two of these, it is possible to construct a basis of any desired size that, for a given dataset, is optimal with respect to the metric, and so provides a gold standard against which other bases can be compared. We shall explain how to do this in Sec. II.D. We do not yet know how to construct a finite basis that is optimal for the Jacobian condition number metric, but we do at least know the value that the metric takes in the limit of a complete basis 30 , so we can use that as the gold standard instead.

Residual variance
Since the expansion in Eq. (1) aims to discretize the i-centered neighbor density, a natural criterion for the quality of a basis is how well it minimizes the error in the approximation of ρ i (x). For a given dataset, this can be defined as the normalised L 2 residual where q is a shorthand for the nlm basis indices, and i runs over all atomic environments in the dataset. If the discrete basis is orthonormal, then one can compute the numerator in this expression as the difference between the mean L 2 norm of the neighbor density and the norm of the coefficients c i q = ⟨q|ρ i ⟩, or equivalently as the difference between the norm of the coefficients in a complete basis and those in a truncated discretization. The numerator in Eq. (12) is not exactly a difference between two variances, because the density and the coefficients are usually not centered to have zero mean over the data set, but we shall nevertheless call ℓ X a residual variance metric following Ref. 13.

Residual Jacobian variance
The Jacobian of the features quantifies their sensitivity to atomic displacements. It can be written in terms of the derivatives of the continuous atom density as where α runs over the Cartesian axes, and in terms of discrete basis states as The Jacobian arises in the calculation of derivatives of ML models (e.g. forces), and it has been used in the past to assess the quality of atomistic representations -both directly 30 and via the sensitivity matrix 2 J T i J i . In the present context, it is natural to define a residual Jacobian variance metric ℓ J for this purpose by analogy with Eq. (12), in which the numerator can again be written equivalently as the difference between the norm of the coefficients ⟨q|∂ jα ρ i ⟩ in a complete basis and those in a truncated discretization.

Jacobian condition number
We shall base our last metric on the condition number of the Jacobian, that is the ratio between the largest and smallest non-zero singular values of J i . As discussed in Ref. 30, in the limit of a sharp Gaussian smearing σ and of a complete basis, all the non-zero singular values of J i take the same value, because the elements of and so the condition number tends to one. An appropriate Jacobian condition number metric is therefore where n train is the number of i-centered environments in the training set and s max (J i ) and s min (J i ) are the largest and smallest non-zero singular values of J i (the square roots of the largest and smallest eigenvalues of From a numerical perspective, condition numbers in the hundreds or thousands are perfectly manageable with floating point arithmetic. However, we would argue that it is beneficial to approach the limit ℓ CN → 0, because this implies that the discretized expansion coefficients are equally sensitive to all types of distortions; i.e., that the basis set does not artificially make the descriptors more sensitive to some deformations than to others. Indeed, it has recently been shown that numerical instabilities that result in moderate values of the Jacobian condition number can affect substantially the performance of ML models 31,32 . For the ℓ CN metric to be meaningful, there must be at least three times more functions in the basis set than the maximum number of neighbors of any atom i in the data set, so that all the sensitivity matrices J T i J i have full rank. Furthermore, the cutoff that is applied in defining the density must be sharp (so that all retained neighbors j of atom i are fully inside the cutoff), and no feature tuning through a radial scaling of the neighbor contributions 17,33 can be applied. Incidentally, the fact that the modulating the feature sensitivity in a physically-motivated manner has a significant positive impact on model performance reinforces the notion that it is better to avoid the unpredictable, noisy modulation of the sensitivity that is caused by the choice of a non-smooth basis.

E. Metric-optimized bases
The logical next step after having defined metrics to assess the quality of a basis is to look for a construction that yields a basis that is optimal with respect to each metric for a given data set. For the residual variance metric ℓ X , this step has already been taken in Ref. 13, so here we only sketch the main ideas.
Starting from a basis set expansion in a large set of primitive basis functions {|p⟩}, one considers an orthogonal contraction of the expansion coefficients in which the orthogonal matrix U is chosen to diagonalize the real symmetric "covariance" matrix 13 such that Then since it is clear that the functions |q⟩ with the largest eigenvalues γ q provide an optimum contracted basis for the minimization of ℓ X in Eq. (12).
Symmetry considerations imply that for a randomly-oriented data set the cross-correlations between features with different SO(3) character must be zero, and so in practice one computes separate blocks of the covariance matrix for each equivariant component As discussed in Ref. 13, it is possible to compute explicitly the radial functions associated with the optimal combinations, and to evaluate the expansion coefficients by approximating them with splines. Since the number of spherical harmonics increases with increasing λ, we retain the covariance eigenstates with the largest values of γ q /(2λ + 1) so as to achieve the most efficient contraction to a given number of basis functions. Since it gives the optimum value of the metric ℓ X for a given dataset, we shall refer to the resulting basis as the X-OPT basis set. An optimal contracted basis for the Jacobian variance metric ℓ J in Eq. (15) can be constructed in the same way, by computing the Jacobian covariance matrix in the primitive basis diagonalizing it, and using the eigenvectors with the largest eigenvalues when weighted by 1/(2λ + 1) to construct the contracted basis functions. We shall refer to the resulting basis as the J-OPT basis set. Both this and the X-OPT basis can be related to the maximization of a Rayleigh quotient akin to Eq. (10), as we discuss in Appendix B and in the SI.

F. Higher correlation orders
Our discussion so far has focused on the discretization of the neighbor density, which when expressed using spherical harmonics for the angular part leads to the first-order atom-centered equivariant correlations ⟨n|ρ ⊗1 i ; 1; λµ⟩. High-accuracy models, however, require higher orders of ACDCs, either built explicitly or through an equivariant network architecture. One should therefore consider how the above ideas generalize to high-order equivariants.

Laplacian eigenstates
The solution of the Laplacian eigenvalue problem in Eq. (7) generates the smoothest possible basis in the 3D sphere Ω. More generally, an analogous Dirichlet problem can be solved in the Ω ν hypersphere, where the density ν-correlation ρ ⊗ν i is defined. The associated Laplacian eigenvalue equation is Since this is separable into ν distinct eigenvalue equations, one for each 3D sphere which constitutes the tensor product, the eigenstates are and the eigenvalues are where ψ nlm (x) and E nl are the eigenstates and eigenvalues of Eq. (7). Furthermore, when a tensor product of atomic densities ρ ⊗ν i ({x k }) is expanded in this basis, the resulting integral is also separable, giving where the c i nlm = ⟨nlm|ρ i ⟩ are the equivariant ρ ⊗1 i density expansion coefficients we have considered above.
The product of these separable coefficients is an order-ν equivariant descriptor in the uncoupled angular momentum basis, ν k=1 c i n k l k m k ≡ ⟨n 1 l 1 m 1 ; . . . n ν l ν m ν |ρ ⊗ν i ⟩ .
As discussed in Ref. 8, these can be transformed into the irreducible coupled form ⟨q|ρ ⊗ν i ; σ; λµ⟩, which can also be constructed iteratively using Eq. (6). Considering the iterative construction, it is easy to see that at each order the coupling combines the projections of degenerate LE functions with the same values of n and l, and so the irreducible SO(3) equivariant basis states are also eigenstates of the Laplacian in Ω ν .
It follows that an ACDC construction based on LE functions will generate the smoothest possible (in the sense of a multi-dimensional generalization of Eq. (10)) equivariant basis in which to expand the ν-neighbor density correlations, provided the expansion is truncated appropriately. In particular, smoothness will be achieved at each order ν by retaining all equivariants with Ω ν Laplacian eigenvalues below some threshold E max (ν). The ν-independent choice is especially convenient because it ensures that every LE basis function with E nl ≤ E max (1) is included in at least one ν-neighbor equivariant, and that no LE basis function with E nl > E max (1) is ever used. Fig. 2 shows that the application of this threshold at each iteration reduces by at least an order of magnitude the number of equivariants relative to the Dashed lines correspond to the features retained by applying the threshold Emax(ν) = Emax(1) + (ν − 1)E10 at each iteration, with Emax(1) = En max,0 . Dotted lines correspond to taking the outer product density basis for ν = 1, and the outer product of the lower-order LE features for ν = 2, 3. number obtained by computing all possible combinations of the ν = 1 features with E nl ≤ E max (1). An additional reduction in the number of features can be achieved by applying selection rules that discard linearly dependent terms, such as considering only lexicographically-sorted (n, l) terms 8 . However, this still does not eliminate the exponential scaling of the number of features with ν. An increasingly aggressive truncation might therefore be preferable for larger ν, especially in situations where higher body-order terms are only expected to make a small contribution to the structure-property relations that are the goal of the calculation.

Residual variance
The selection of high-ν equivariants on the basis of their Ω ν Laplacian eigenvalues is clearly only a viable option for the LE basis: it cannot be done for the metric-optimized X-OPT and J-OPT basis sets introduced in Sec. II.D. There are, however, alternative ways to select the high-ν features for these basis sets, 8,34,35 as we shall now briefly describe for the case of the X-OPT (minimal residual variance) basis.
The basic idea is to retain as much variance as possible in the high-ν density expansion coefficients, just as the X-OPT basis retains maximum variance in the ν = 1 coefficients. The limit on the amount of variance that can be retained with a finite basis is determined by the CG iteration in Eq. (6), which implies that the square modulus of each combination of features is the product of the square moduli of the lower-order features being combined: 13 l1l2 σλµ Clearly, therefore, the fraction of the variance that can be retained in the high-ν equivariants of a finite nlm basis decreases exponentially with increasing ν. The goal is nevertheless to retain as much of this limiting variance as possible, while at the same mitigating the exponential increase in the number of equivariants with ν.
The N -body iterative contraction of equivariants (NICE) scheme 8 does this by performing a principal component analysis (PCA), and then evaluating the combinations with the highest variance so as to yield the optimal features for a given size of the pool of ACDC equivariants. However, this a posteriori contraction scheme requires the evaluation of a large number of features before applying the projection step, which can become quite expensive. We have therefore also tested a cheaper iterative variance optimization (IVO) scheme that pre-selects a smaller number of high-order features and functions in a similar way to the iterative CUR selection of Refs. 34,35. This IVO scheme is described in Appendix C and its highν equivariants are compared with those of the PCA scheme for the X-OPT basis in our benchmark calculations in Sec. III.

Jacobian condition number
Consider the Jacobian J i computed for an equivariant of body order ν, with elements J qσλµ,ijα = ⟨q|∂ jα ρ ⊗ν i ; σ; λµ⟩. As we show in Appendix D, in the sharp Gaussian smoothing and complete basis set limit, J T i J i can be evaluated in closed form, yielding which again implies a condition number of one. [Note that this expression involves summing over multi-ple equivariant channels, and that in general invariant features cannot have uniform sensitivity: for high-symmetry structures (and some non-symmetric ones, in the case of degenerate low-order descriptors) ⟨q|∂ jα ρ ⊗ν i ; 1; 00⟩ is strictly singular. 30 ]

III. BENCHMARKS
We shall now compare the performance of the LE basis with that of the data-driven X-OPT and J-OPT bases and the Gaussian-type orbital (GTO) basis presented in Ref. 11. To do this we shall use the three metrics introduced in Section II D, as well as regression exercises involving DFT potential energies as the supervised target.

A. Chemical datasets and parameters
We have performed benchmark calculations on the AIRSS 36 carbon dataset 37 and the random methane dataset 38 (using carbon-centered environments only). The random selection of carbon-centered environments in the methane dataset provides a highly uniform distribution of neighbors, whereas the AIRSS carbon structures are more representative of the sort of systems for which potential energies are typically learned. A subset of the 10,000 lowest-energy structures in the random methane dataset was also employed to illustrate how a data-driven basis can be optimized for predominantly tetrahedral structures. A tetrahedral order parameter 39 is used in the SI to establish the tetrahedral character of the low-energy methane structures.
The cutoff radius r cut was set to 3.5Å for the random and low-energy methane datasets, and to 5.0Å for the condensed phase carbon dataset so as to include more coordination shells. Gaussian-smoothed densities were used, with a Gaussian smoothing parameter of σ = 0.2Å. Since all the C-H distances in the methane dataset are below r max = 3Å, the hydrogen density with this smoothing parameter is essentially zero at r cut = 3.5Å, so we used a radius of a = r cut when defining the LE basis functions. For the learning exercises on the condensed phase carbon dataset, we also used a = r cut , but introduced a cutoff function to damp the density to zero at this radius. The precise form of the cutoff function is given along with the other details of the LE expansion in Appendix A.
The LE expansion was truncated by retaining basis functions with E nl ≤ E max as described in Sec. II.B. The more traditional GTO and "rectangular" X-OPT bases, which correspond to using rectangular cutoffs in Fig. 1, were truncated via three representative (n max , l max ) combinations (small (6,4), medium (8,6), and large (12,9)), and they are indicated by square markers in our figures.
In what follows we report only selected results that highlight the most important observations for each type of benchmark. A more comprehensive series of plots, together with a more detailed discussion, can be found in the SI.
B. Density expansion coefficients

Residual variance
The residual variance tests conducted on methane are shown in Fig. 3. The X-OPT basis (unsurprisingly) performs best for this metric, while the rectangular bases generally suffer from the exclusion of high-l channels. The LE basis is competitive with all the others, and, given an X-OPT basis of a certain size, it only takes a slightly larger LE basis to match it in terms of captured variance. The benefits of using a data-driven basis are more pronounced for the lowenergy methane configurations, in which the H atoms are non-uniformly distributed around the central carbon.  The curves are the results obtained with the X-OPT, LE, and J-OPT basis sets truncated to the specified number of (n, l) basis functions (or expansion coefficients) as described in the text. The square markers correspond to full outer-product X-OPT and GTO basis sets with (nmax, lmax) = (6, 4), (8,6) and (12,9).

Jacobian variance
The Jacobian variance tests for methane shown in Fig. 4 confirm that the J-OPT basis is optimum for this alternative metric. Once again, the (n max , l max )truncated bases suffer from the lack of high-l coefficients. The LE basis is competitive with the dataadapted bases for random methane, although noticeably less so for the heavily optimizable low-energy methane structures.
Comparing Figs. 3 and 4 reveals the trade-off between optimizing the variance and the Jacobian variance: although one has to choose one target property to optimize, the resulting basis set also performs well for the other task. This is especially true of the X-OPT basis, which does remarkably well on the Jacobian variance test. Another key observation is that tuning the number of radial functions within each l channel results in a systematically more efficient encoding of information than is present in an outerproduct basis of comparable size. This is particularly evident when comparing the two truncation methods for the X-OPT basis. When using the same number of radial functions for each angular momentum channel, the improvement over a primitve GTO basis is barely noticeable, whereas using an adaptive truncation of the radial basis yields an improvement of a factor between 2 and 5 in both ℓ X and ℓ J .

Jacobian condition number
The Jacobian condition number (CN) tests are shown in Fig. 5. In contrast to the variance and Jacobian variance tests, these CN tests show the LE basis outperforming all of the other basis sets we have considered. Because it is smooth in the entire atomic neighborhood, even a small LE basis yields a uniform approximation to the atom density, and therefore small anisotropy in the sensitivity of the features to structural deformations, even though it provides a worse L 2 approximation to the Jacobian (see Fig. 4). The X-OPT basis has a comparable CN to the LE basis for the random methane data set, but a much higher CN for the more optimizable low-energy methane subset. This behaviour can be understood by noting that adapting the basis to the data set makes it less uniform, leading to a more anisotropic response to deformations. A similar reasoning explains why the J-OPT basis, which provides a better approximation to the density derivatives, yields an even higher CN (and therefore a worse result still in this respect) than the X-OPT basis.

C. Equivariant ν = 2 features
The density expansion coefficients (ν = 1 equivariant features) are the starting point of all densitycorrelation descriptors. In order to build accurate ML models, however, it is almost mandatory to increase the correlation order beyond ν = 1. To investigate the impact of the body-order iteration on the different metrics we have introduced, we have repeated the residual variance and Jacobian condition number tests using ν = 2 equivariants. For this we have confined our attention to the LE and X-OPT bases, because they perform well on the other tests and their associated truncation criteria readily generalize to higher body-order features.

Residual Variance
The residual variances of the LE and X-OPT ν = 2 equivariant features for the random and low-energy methane structures are shown in Fig. 6. The PCA contraction retains more variance than the LE truncation, as is obvious from its construction. Contrary to the ν = 1 case, however, the retained variance gap is large, especially for the low-energy methane struc-  The LE-LE data corresponds to density correlations built on the LE basis, and truncated based on the LE eigenvalues. The X-OPT-PCA and X-OPT-IVO data correspond to a variance-optimal density expansion, with correlations further contracted using PCA or selected using the IVO scheme.
tures, indicating that the ν = 2 equivariants are significantly more compressible than the ν = 1 density expansion coefficients.
From a numerical point of view, this comparison is not entirely fair, because while the X-OPT basis can be evaluated at no additional cost 13 compared to the LE basis, the PCA contraction of higher-order representations requires the evaluation of all features before contracting them. This makes it more expensive than using Eq. (29) to pre-select the relevant higher-order equivariant features for the LE basis, as we have done in computing the LE-LE curves in Fig. 6. The IVO scheme, which is based on the selection of the highest variance features, is more closely comparable: after having identified the optimal components, it only requires evaluating those that have been chosen. The resulting IVO equivariants in Fig. 6 have slightly higher residual variance than those from PCA, but they are still significantly lower than those from LE-LE.

Jacobian condition number
The Jacobian condition numbers of the ν = 2 equivariants from the X-OPT and LE methods are shown in Fig. 7. As in the ν = 1 case, the LE equivariants give the smallest CN, for both the random and the low-energy methane structures. The CN of the X-OPT equivariants increases significantly on going from the random to the low-energy structures, regardless of the feature reduction method employed (PCA contraction or IVO selection). This reinforces the notion that adapting the basis to focus on the correlations that are most common in a given data set increases -rather than decreases -the anisotropy of the Jacobian, at least in the small basis set limit. As more ν = 2 equivariants are included in the X-OPT calculation, the Jacobian CN decreases, and by the time there are 10 5 it appears to be approaching that of the LE calculation.

D. Learning from invariant features
While the metrics discussed above provide an "unsupervised" estimate of the quality of the basis set used for the density expansion, the ultimate goal of most ML models is to achieve accurate predictions for a target property. We take as a paradigmatic example the construction of models of the interatomic potential, using only energies for training, comparing the performance of the different radial basis sets we have considered. As in the previous section, we show results for some representative cases, and refer the reader to the SI for further examples and a more comprehensive discussion.
1. Invariants with ν ≤ 2 Fig. 8 compares the performance of (up to) ν = 2 invariants built from different bases as inputs to linear, kernel, and neural network (NN) models. Linearbased learning saturates quickly due to its inability to account for high body-order interactions, while kernel and NN-based models do not suffer from this issue. The relative performance of different bases is similar across different machine learning models, highlighting a high degree of "orthogonality" between the choice of basis and that of the fitting model, at least for the datasets we have considered.
There is comparatively little to choose between the X-OPT, GTO, and LE results in Fig. 8 for either methane dataset -random methane or lowenergy methane -regardless of whether the potential is learned from a linear fit, with a non-linear kernel, or with a neural network. The only real outlier is the small X-OPT basis, which performs particularly poorly for the random methane dataset. The "optimization" of this basis to fit the density more accurately than a LE basis of comparable size clearly does not help with this learning exercise. Indeed, while the differences are often small, the LE results are slightly better than those of the other two basis sets in all 12 of the methane panels in Fig. 8.
The results of the carbon tests in Fig. 8 are another matter. This is a simpler problem that can be solved to "chemical accuracy" (a root mean square error of less than 1 kcal/mol) using any of the three basis sets, provided either a non-linear kernel or a neural network is used to extract higher body-order contributions from the invariant features with ν ≤ 2.
Once this is done, the results for all three basis sets are essentially the same, provided an appropriate cutoff function is applied to the density to make it go smoothly to zero at r cut . [As we have done in these tests: for the LE basis, we used the cutoff functions f 1 (x) and f 2 (r ij ) defined in Appendix A. These were also used when constructing the primitive LE basis that was contracted to produce the X-OPT basis functions. For the GTO basis, f 1 (x) is not needed, so we simply used f 2 (r ij ).] The use of an aggressive cutoff function that emphasizes the short-range interactions, that are dominant for this problem, is essential to achieve the good performance seen in Fig. 8. As has been noted previously 7,17,33 , this type of feature engineering, which incorporates prior knowledge on the target function, is very effective in improving the performance of a ML model. On the other hand, it appears that in this case the choice of basis set plays only a very limited role. The most accurate results for the carbon dataset in Fig. 8 are seen for the X-OPT basis, which is optimized to fit the radially-weighted density once the cutoff function has been introduced. However, these results are only marginally better than those obtained with the LE basis, which is not optimized to fit the dataset at all.

Invariants with ν ≤ 3.
Given the saturation of the linear learning curves in Fig. 8, the linear learning exercises for the methane datasets were repeated using (up to) ν = 3 invariants. The results of these tests are shown in Fig. 9, where it is clear that, when enough ν = 3 invariants are included, the curves do not saturate within the range of training set sizes considered. Note also that the use of the LE basis, together with the LE selection scheme for high-order equivariants (LE-LE), affords the lowest RMSEs in these tests for both methane datasets, by a substantial margin. This suggests that the optimum smoothness of the LE basis, which extends in a natural way to the smoothness of the higher body-order features obtained from the LE-LE scheme as described  in Sec. II.E, is an important ingredient to successfully learn potential energies from high-ν invariants. Since this seems to be the direction of travel of much of the most recent work on machine learning potentials, 3,7,8 it could well be the most significant result in this paper, especially since the data-driven and ostensibly "optimized" X-OPT-IVO and X-OPT-PCA results in Fig. 9 are so poor by comparison.
It is important to keep in mind here, though, that the methane dataset, particularly when learning with only C-centered descriptors, is particularly challenging, and heavily dependent on the convergence and resolution of the basis set. Even though the advantage of the LE-LE scheme for other regression tasks, such as learning with both C-and H-centered descriptors, are not be quite so pronounced, they are nonetheless significant (see the next Section, and especially Fig.  10). In particular, it is remarkable that a smoothness criterion, that leads to a very high loss of information content in terms of ℓ X (see Fig. 6), proves so effective when it comes to regression performance. This suggests that a complete description of the ν-order density correlations might not be necessary to converge ML models based on them, and that actually seeking the most complete description possible (as is done in X-OPT-PCA) might not be the best approach after all.

Application to body-ordered expansions
The LE-LE idea presented in the previous section is easily extensible to body-ordered expansion frameworks 7,8 . We will now show how the LE basis improves over previous basis functions when fitting interatomic potentials for the molecules in the rMD17 dataset. 40 Given that we will aim to compare with existing results based on the ACE scheme, 41 we will incorporate some of the heuristics that have been used in previous ACE work. Thus, we refer to the resulting model as LE-ACE.
Before proceeding to this comparison, we note that one of the challenges when benchmarking linear models is that validation errors depend strongly on reg-ularization, and in particular they depend in a nonmonotonic way on the size of the feature vector. For example, when using a small training set one often observes that aggressive truncation results in more transferable models. In the LE-LE context, this means that the E max (ν) values should be optimized as a function of n train . This follows because the maximum frequency component of the learned potential at each body order depends on E max (ν). As more samples are included in the training set, the configuration space of the system is sampled more densely, and therefore higher-frequency components of the potential can be learned. Including basis functions whose frequency is too high leads to overfitting, whereas if E max (ν) is smaller than optimal there will be underfitting.
In order to estimate how E max (ν) should vary as a function of n train , we note that E max (ν) is proportional to the square of the maximum sampled frequency ω in the Ω ν hypersphere. Since ω is proportional to the inverse of the typical distance d between configurations in the Ω ν space, we have E max (ν) ∝ ω 2 ∝ 1/d 2 . If we consider one training point to represent one point in the Ω ν hypersphere, the typical distance d scales as (V ν /n train ) 1/3ν , where V ν is the volume of Ω ν . Hence, E max (ν) ∝ n 2/3ν train . We can illustrate this idea for the random methane dataset by extending the training set beyond the 10,000 configurations we have used in most of the present work. As shown in Fig. 10, the proposed scaling of E max (ν) gives rise to a learning curve that decays linearly on a log-log scale. In contrast, all other learning curves are non-optimal. The ν = 3 NICE curve 8 shows limited accuracy and saturation due to its inability to describe 5-body interactions in the gas-phase methane molecules. The ν = 4 NICE and non-optimal LE-ACE curves show signs of saturation (underfitting) in the data-rich regime, as well as overfitting in the data-poor regime, due to their sub-optimal resolution. By contrast, the newly proposed E max (ν) ∝ n 2/3ν train truncation provides a good fit regardless of the amount of training data, suggesting that our frequency-based argument provides a valid heuristic to adapt the resolution of a linear model to the amount of data available.
We will now compare the performance of this new model with that of a standard ACE implementation on the rMD17 dataset 40 , which is widely used as a benchmark, and allows us to compare LE-ACE with several recently-proposed models. This is however a somewhat problematic data set, as it contains highly-correlated structures generated from short, low-temperature molecular dynamics trajectories: as noted in Ref. 42, the most instructive comparisons can be obtained in the data-poor regime. Further details regarding our implementation and additional benchmarks are provided in Appendix E and Section 4 of the SI. We should emphasize here that, for fairness of comparison, we have incorporated a heuristic optimization in the form of a non-linear transformation of the radial coordinate [see Eq. E1], as is al-most universally done when learning interactomic potentials in ACE 7,12,41 . Our LE-ACE results for the rMD17 dataset are shown in Table I, where they are compared with the results of a standard ACE implementation 41 and those of two state-of-the-art equivariant message-passing neural networks (NequIP 24 and MACE 42 ). While there might be other minor differences between the two implementations, it can be seen that the use of the LE basis allows for a significant and systematic improvement in accuracy over the basis functions used in ACE. In some cases, the LE-ACE model is also competitive with NequIP and MACE. It is reasonable to assume that the incorporation of the ideas that underlie the LE basis into NequIP and MACE would lead to similar improvements in the message-passing context to the improvements LE-ACE offers over ACE.

IV. CONCLUSIONS
In this paper we have investigated the use of the LE basis for the expansion of local atomic densities, explained why this basis, when truncated with the eigenvalue criterion E nl ≤ E max , provides the smoothest possible basis of a given size within a sphere Ω in the sense of minimising the Rayleigh quotient in Eq. 10, and described how this property extends in a natural way to the representation of density correlations of arbitrary order ν within the relevant hyperspheres Ω ν . Several of these developments have precedents in the recent atomistic machine learning literature. For example, the truncation criterion E nl ≤ E max leads to the retention of different numbers of radial basis functions in each angular momentum channel l, as is increasingly done when using atomic cluster expansion (ACE) descriptors. 12,43 The key difference is that the E nl ≤ E max criterion tells us exactly how many radial basis functions to retain for each value of l, without the need to introduce any further parameters. We have also described how an analogous criterion can be used to truncate higher body-order features of the LE basis [see Eq. 29], and provided a scaling argument that shows how to adapt the parameters in this more general truncation scheme to the training set size [E max (ν) ∝ n 2/3ν train -see Sec. III.D.3]. We began by arguing that a smooth basis is desirable for machine learning because it implies a smooth interpolation (and possibly extrapolation) of machine-learned properties to configurations that are not present in the training set. Our learning results in Figs. 8-10 certainly support this argument. In all cases, the LE basis gives results are either comparable to or better than those obtained with data-driven basis sets obtained by optimizing the L 2 fit to the density, which we find results in a less uniform sensitivity of the density coefficients to atomic displacements.
The advantage of the LE basis is especially apparent in the learning results in Fig. 9, in which the ν ≤ 3 invariants are built from ν ≤ 2 equivariants that benefit from the smoothness of the LE representation of the density correlations in Ω 2 . The fact that this advantage is obtained when the LE ν = 2 equivariants only capture a small fraction of the variance in the ν = 2 density correlations present in the dataset (Fig. 6) is especially interesting, as it suggests that a high resolution representation of the density correlations may not be needed to learn the potential energy.
Our learning results are anti-correlated with the results of our residual variance (ℓ X ) tests in Figs. 3 and 6, but well correlated with the results of our Jacobian condition number (ℓ CN ) tests in Figs. 5 and 7. While this does not imply that the Jacobian condition number is the ultimate indicator of basis set quality, we do believe that it points at the importance of achieving a discretization of the density that has "uniform resolution", and thereby does not artificially distort the sensitivity of the features to atomic deformations.
A "signal-processing" view, in which features need to provide a basis that fully describes the density correlations, implies an exponential increase in the number of features with body order ν. If instead it suffices to retain enough features to guarantee that the Jacobian is full rank and well-conditioned for every configuration, then one may hope that a number of features proportional to the maximum number of neighbours within r cut would suffice. Our learning results in Figs. 8-10 suggest that, when it comes to building interatomic potentials, a basis and a truncation strategy that are optimal in a signal-processing sense are less effective than a basis and a truncation strategy aiming for "optimal smoothness", which is the approach we have taken in this paper.

SUPPLEMENTAL INFORMATION
See supplemental information for a more comprehensive set of benchmarks, its analysis, and a detailed account of the Rayleigh-quotient formulation of optimized bases.

AUTHOR DECLARATIONS
The authors have no conflicts of interest to disclose.

DATA AVAILABILITY
The data that support the findings of this study are available in the paper and the supplementary material. The datasets we use have been published elsewhere and are available from Refs. 44-46. functions, which we have used in our calculations on the carbon dataset, is The modified Gaussian smearing functions in Eq. (A4) can be evaluated in the LE basis as ⟨nlm|r ji ;g⟩ = ⟨nl|r ji ;g⟩ ⟨lm|r ji ⟩ . (A6) The radial integrals are given by where i l (x) is a modified spherical Bessel function. These radial integrals can be pre-computed on a grid of r values and fit to cubic splines, making the sum over j in Eq. (4) cheaper to evaluate for each new arrangement of the atoms. A computer program that implements these equations is provided in the supplementary material.
In the methane tests considered in the text, no hydrogen atom is ever more than r max = 3Å away from the central carbon atom, so it suffices to dispense with the cutoff functions and set a = r max + 2.5σ. The cutoff functions should be eliminated for the Jacobian variance tests in any case, so this is how we performed all our methane calculations.

Appendix B: Rayleigh-quotient formulation of optimized bases
In the main text we have constructed the X-OPT and J-OPT bases by diagonalising the covariance matrices in Eqs. (22) and (23) in a large primitive basis. It is however also possible to introduce a framework that reveals a closer connection between these datadriven bases and the LE construction. In fact, all orthogonal sets of basis functions can be thought of as solutions to an appropriate eigenvalue problem or, equivalently, as the orthogonal functions that optimize an appropriate Rayleigh quotient.
The X-OPT basis is designed to focus on regions where the densities ρ i are usually large, and the J-OPT basis on regions where their gradients are large. To formalize the concept of "where the density is usually large", we can introduce the function where the sum runs over all atomic environments. This function measures how correlated the densities are at the locations x and x ′ . Similarly, we can introduce a gradient analogue that measures correlations between the derivatives of the neighbor density. Using these functions, we can define the Rayleigh quotients and that are maximized by the X-OPT and J-OPT bases, respectively. (It is also possible to find linear operators L such that the X-OPT and J-OPT basis functions are solutions to and to write the associated Rayleigh quotients as as discussed in more detail in the SI.) A key difference between these bases and the LE basis is that the X-OPT and J-OPT bases maximize these Rayleigh quotients, whereas the LE basis minimizes the Rayleigh quotient in Eq. (10). This difference can be used to shed some qualitative light on why it is that the Jacobian condition numbers obtained from the J-OPT basis in Fig. 5 are so large. An integration by parts in Eq. (B4) gives This has a similar form to the Rayleigh quotient in Eq. (10), but with a non-local (positive semi-definite) kernel c(x, x ′ ) rather than a delta function δ(x − x ′ ). The LE basis minimizes the Rayleigh quotient in Eq. (10) to give smooth basis functions, whereas the J-OPT basis maximizes the Rayleigh quotient in Eq. (B7) to give basis functions that are considerably less smooth.

Appendix C: Iterative variance optimization
Here we describe a cheaper IVO alternative to the PCA contraction of high ν equivariant features. This alternative is a variation on a theme of the iterative CUR selection, 34 which proceeds by selecting the feature with highest variance and then orthogonalizing the remaining features to it. In order to ensure that the equivariance of the features is preserved, we orthogonalize separately features with different λ (and σ) character. This can be achieved in practice by treating the different projection components µ as if they were separate samples.
Take X to be the [n train (2λ + 1)] × n feature "flattened" feature matrix with angular momentum character λ. Compute the column-wise norm, and select the index k corresponding the largest norm. All remaining column vectors X j are then orthogonalized with respect to column X k , the column norm is re-computed, and the selection repeated. By choosing a threshold on the residual norm, we can achieve a truncation that mimics a principal component selection, retaining only the features that contribute most to the variance for each equivariant block.
It is important to recognize that the orthogonalization removes duplicate or linearly dependent features. When using the selected features in a regression scheme, this does not entail any loss of information (e.g. the global feature-space reconstruction error, GFRE, 47 would be zero). However, it does affect our metrics ℓ X and ℓ J , because the reduced feature matrixX that is obtained by assembling theñ feature highest variance features will have a lower variance than X when duplicates have been removed. In other words, the discretized coefficients are not a unitary transformation of the real-space neighbor density (i.e. the compression leads to a large feature-space distortion, or GFRD 47 ), which is incompatible with our goal of obtaining features that yield a lossless encoding of the neighbor distribution.
Fortunately, this is easily remedied by computing a "weighing matrix" W such that where (·) + denotes the matrix pseudo-inverse. The weighted features can then be obtained asXW, and these recover the least-distorted compression for a given selection. To understand why, consider the case in which only two identical features (columns of X) are present, and one is removed. Even though no information is lost, the squared norm of the selected features is halved relative to the starting features, affecting both ℓ X and ℓ J . If the retained column is multiplied by √ 2, however, the squared norm of the starting features is recovered, and the procedure is lossless for ℓ X and ℓ J . The W matrix generalizes this idea to multiple features with arbitrary linear correlations to the features that have been removed.

Appendix D: High-order Jacobian
Here we provide a simple (if cumbersome) proof that the Jacobian of any high-order set of equivariants built from a complete discretization of the neighbor density is diagonal and has condition number equal to one, i.e. that the result in Eq. (31) holds true.
To clarify the argument, we shall avoid the complications that arise from resolving the parity index σ, and work with Eq. (31) in its unresolved form When σ is resolved, the final result is the same, but with an additional sum over the parity index in each of the terms in Eq. (D5). In this simplified notation, a generic application of the CG iteration used in Eq. (6) gives Combining two terms of this form, and exploiting the orthogonality of CG coefficients (D3) we find that the left-hand side of Eq. (D1) can be written as the sum of four terms, each of which is associated with a product of two lower-order contractions: where ν = ν 1 + ν 2 and In the limit of sharp Gaussians and a complete basis, XX(1) is proportional to the number of neighbours n i of atom i, and is independent of the positions of these neighbours. As a consequence, XJ jα (1) is zero. And J jα J j ′ α ′ (1) is just the Jacobian of the density expansion coefficients, which is shown in Ref. 30 to be a constant times δ jj ′ δ αα ′ . This proves (D1) for ν 1 = ν 2 = 1 and ν = 2. Considering that the density sum rule (30) implies that XX(ν k ) ∝ n ν k i , and therefore XJ jα (ν k ) = 0, the general case can be verified by induction.

Appendix E: LE-ACE model
The regression results shown in Sec. III D 3 were obtained by incorporating the LE basis into ACE 7 . In this section, we briefly describe the peculiarities of the resulting LE-ACE model.
As in the original ACE method, delta-like densities are employed and a radial transform is used to give more weight to short-range interactions 7,12,41 . In our model, the radial transform takes the form where r is the original radial coordinate, x is the transformed coordinate, a is the radius of the sphere inside which the Laplacian eigenstates are defined, and f is a multiplicative factor that can be optimized. This radial transform, unlike others in the literature, can be applied directly to the LE basis. The resulting basis functions ψ(ξ(r)) have the desirable property of going to zero, along with all their derivatives up to infinite order, at r = a. This guarantees the continuity of the predicted target property and its derivatives of any order at the cutoff radius. Following Ref. 41, whose results are reproduced in Table I under the ACE label, we use different cutoff radii for the two-body (ν = 1) interactions and for higher body-order interactions. These cutoff radii are set to 5.5Å and 4.4Å, respectively, as they are in Ref. 41. It should be noted that these choices for the cutoff radii, while tuned for ACE and also adopted here for consistency purposes, may not be optimal for LE-ACE, which uses different basis functions and a different radial transform.
Tikhonov regularization is employed during the linear fitting stage. Following Ref. 41, the regularization matrix is chosen to be diagonal with entries corre-sponding to estimates of the derivatives of the basis functions. However, while in Ref. 41 the degree of the estimated derivative is an optimizable parameter, here we always choose to use an estimate of the first derivative, given by √ E b , where E b is the eigenvalue of the (possibly high-order) LE basis function b. See Section II B for more details on the connection between E b and the mean square gradient of the LE basis functions.
Given the points above, in order to fully define the LE-ACE model, only the maximum correlation order ν max and the maximum Laplacian eigenvalues E max (ν) for 1 ≤ ν ≤ ν max need to be specified. ν max is set to 3 for azobenzene, uracil, and paracetamol, and to 4 for all other molecules in the rMD17 dataset. These parameters, which ensure consistency with the reference ACE benchmarks 41 , are almost certainly not optimal. This needs to be taken into consideration when comparing LE-ACE with NequIP and MACE, which can theoretically describe interactions up to infinite body-orders. A more justifiable choice is that of ν max = 4 for random methane (where higher-bodyorder contributions are not relevant).
Regarding the E max (ν) thresholds, these should be adapted according to the importance of the bodyordered contributions in each specific chemical system and to the extent to which the training set samples the potential energy surface. While the details are relegated to the SI, we emphasize that we did not optimize these parameters separately for each molecule in the rMD17 dataset (as was done in the ACE benchmarks). Instead we used a single set of coefficients for all molecules for which ν max = 4, and a different set for those for which ν max = 3 (see SI).

Supplemental Information 1 Low-energy methane structures
The nearly random nature of the random methane dataset does not leave much scope for optimization with a data-driven basis. In order to test structures where more room for optimization is available, the 10 000 lowest-energy structures were extracted from the 7 732 488 structures in the random methane dataset, and subsequently used as an independent dataset. Figure S1 shows the distribution of the angular tetrahedral order parameter S g presented in Ref. 34 in the low-energy subset and in a random 10k subset of the random methane dataset.

S1
supplemental_figures/order-parameter.pdf Figure S1: Angular tetrahedral order parameter S g for 10k random structures in the random methane dataset and for the 10k lowest-energy structures. Each blue dot represents the S g parameter for a single structure, while the red line represents the average value in the two sub-datasets.
The value of S g varies from 0 to 1, and it is 0 for a perfect tetrahedron. The average S g value for the random structures is 0.256823, which is very close to the theoretical average value of 1/4 for randomly distributed points [34] . In contrast, the average order parameter is 0.188703 for the low-energy structures, indicating that they are, on average, significantly closer to an ideal tetrahedral configuration.

Full results
This section contains the results of the tests comparing different bases and truncation criteria. In all cases, a LE basis truncated with the "rectangular" l max = 40, n max = 25 parameters was employed as the large basis from which all data-driven bases were built. This basis was found to be converged in all the performed tests, and it was also used to measure the variance and Jacobian variance of the structures in the complete basis limit. The density smoothing parameter σ was set to 0.2Å both for the methane dataset and for the AIRSS carbon dataset. Larger values result in a plateau of the condition numbers within the range of basis S2 sizes considered, while smaller values of σ delay the convergence of the condition numbers to 1. In all cases, the a parameter of the LE expansion was set to r cut + 2.5σ in order to guarantee the expansion of the vast majority of the density around each atom in the relevant neighborhood. This is an important consideration in the Jacobian condition number tests.

Density expansion coefficients
When comparing the density expansion coefficients across different bases and basis sizes, each basis is truncated by its own natural truncation criterion. Specific (n max , l max ) pairs are also shown for the GTO and X-OPT bases, as this is the way these two bases had been truncated previously to this work. Due to their shape in an n max − l max plot (e.g., Fig. 1), these truncations will be referred to as "rectangular". For each test, three rectangular truncation parameter pairs will be shown: (n max , l max ) = (6, 4), (8,6), (12,9). These are represented by isolated square markers in the plots.

S3
From Fig. S2, it can be seen that, naturally, all curves are monotonic: as the size of the basis increases, it captures more of the atomic density in the structures. The X-OPT basis achieves the lowest residual variance in all cases, as it is optimized to do so. The rectangular bases suffer from the exclusion of high-l contributions, except in the low-energy methane case, where the tetrahedral nature of the structures favors l ≤ 3 channels. In contrast, the natural truncation criteria used for the LE, X-OPT and J-OPT restrict the truncation of the basis to a single parameter (which unequivocally specifies the size of the basis), and they outperform the (n max , l max ) truncations in the vast majority of cases. This phenomenon is not limited to the variance tests, but it also holds true in the other tests considered. In particular, it is useful to note the difference in performance between X-OPT and Rect. X-OPT, which are the same basis truncated in different ways, and the similarity between GTO and Rect. X-OPT in all considered tests, which highlights that a poor truncation choice can almost nullify the efforts made in optimizing the basis functions themselves.
It can be seen that the LE basis fails to optimize for low-energy methane. Indeed, while the other bases capture more variance in the low-energy case, the LE curves are almost exactly the same as for random methane. In this context, it is worth noting that the GTO basis, despite not being data-driven, captures more variance in the low-energy structures because it focuses on the closest regions to the central atom, and this is beneficial when expanding the low-energy densities, where most of the H atoms are close to the C center (relative to the 3Å cutoff radius). Despite its inability to adapt to the specific datasets, the LE basis is always competitive, and, given an X-OPT basis of a certain size, only a slightly larger LE basis is needed to achieve the same residual variance. Moreover, the fact that the carbon tests reproduce more closely the behavior of the random methane tests -rather than the low-energy ones -indicates that realistic bulk structures display closer to random behavior rather than highly organized low-energy-methane-like distributions.

Jacobian variance tests
supplemental_figures/jacobian-variance.pdf Figure S3: Jacobian variance tests. Fig. S3 confirms that J-OPT is the optimal basis on the Jacobian variance test (by construction), as it always outperforms all other considered bases. Similarly to the variance tests, and despite the fact that the LE basis cannot adapt to different datasets, it is nonetheless competitive with J-OPT and the other bases in all cases, with the possible exception of the low-energy methane structures. Once again, the rectangular bases suffer from the lack of high-l channels (but not in the low-energy methane case, where high-l contributions are less important due to the high symmetry of the structures). It is also interesting to note how the X-OPT basis performs almost as well as the J-OPT basis in this test, while the opposite is not true (i.e. J-OPT is noticeably worse than X-OPT) in the variance tests. This perhaps suggests that, as an optimized basis, X-OPT is a better choice than J-OPT. The Jacobian condition number tests in Fig. S4 reveal that the LE basis leads to a highly uniform and isotropic mapping of real space into feature (expansion coefficient) space. The fact that the data-driven bases perform worse than LE, and especially when the room for optimization is larger (e.g., low-energy methane), follows from the fact that the Jacobian CN is a measure of the distortions caused by the transformation from real space to feature space. The GTO and Rect. X-OPT bases generally perform worse than the others: their arbitrary truncation leads to the exclusion of important spherical expansion components (in particular, the complete exclusion of high-l channels) and the inclusion of less important ones. Specifically, the sharp change in the number of selected basis functions as a function of l (from n max at l max to 0 for l > l max ) results in a high degree of distortion which translates to large condition numbers. As noted in the previous tests, the two bases truncated via the rectangular truncation criteria (GTO and Rect. X-OPT) achieve essentially the same performance, suggesting that the truncation criterion is a crucial feature of a given basis, and that it is possibly even more important than the choice of the basis functions.

Equivariant ν = 2 features
Here, and in the ν = 3 learning exercises, only the LE and X-OPT bases are taken into consideration, as they both perform well in our ν = 2 learning exercises and generalize in a consistent way to higher body-orders. The generalizations of the LE and X-OPT bases are, respectively, the high-order LE selection and PCA-like contractions of the high-order equivariants, as detailed in the main text. IVO (presented in Appendix C) will also be tested as a computationally cheaper alternative to PCA.
All parameters in the density expansions were kept the same as in the ν = 1 tests above. For each LE-LE equivariant set size in the following plots, the ν = 1 LE basis was chosen as the smallest possible basis that enables the computation of all required LE-LE ν = 2 equivariants (i.e., the ν = 1 LE energy threshold is found from Eq. 29). Starting from this "minimal" ν = 1 LE basis, all the available ν = 2 equivariants were calculated (including those above the ν = 2 LE threshold). IVO and PCA were then used to contract the ν = 2 equivariant set to a similar size as the corresponding LE-LE equivariant set. These two contractions constitute the LE-IVO and LE-PCA ν = 2 bases. Finally, a ν = 1 X-OPT basis was calculated in the same manner as for the ν = 1 methane tests. The size of the X-OPT basis was chosen so as to be similar to that of the corresponding "minimal" ν = 1 LE basis. Starting from this X-OPT basis, all available ν = 2 equivariants were calculated. The X-OPT-IVO and X-OPT-PCA ν = 2 bases were then obtained by performing IVO and PCA on the resulting equivariant set to yield equivariant sets of similar size as LE-LE.
[In this context, a "similar" equivariant count corresponds to a tolerance of ±25%.]

Variance tests
The variance of the ν = 2 equivariants is calculated in the same way as that of the ν = 1 equivariants (Eq. 12).

S7
supplemental_figures/variance-nu2.pdf Figure S5: ν = 2 variance tests. Fig. S5 shows that, although the ν = 2 contractions do have an effect on the ν = 2 variance tests, the majority of the performance differences arise from the ν = 1 bases, at least with our "minimalist" ν = 1 truncation approach. Indeed, it should be noted that LE-PCA and X-OPT-PCA would converge to the same basis in the limit of complete ν = 1 expansions. Our choice, however, is closer to the scenario of a real application, a case in which minimizing storage requirements and computational cost would be desired. As noted in the ν = 1 tests, all bases perform similarly on the random methane dataset, while their performances differ more significantly on the low-energy methane structures. Unsurprisingly, X-OPT-PCA performs better than the other truncations on both datasets. However, the difference in performance between the latter and LE-LE is larger than in the ν = 1 case, and this is especially true for low-energy methane. This hints to the ν = 2 features being much more compressible than the ν = 1 features, a fact that can be rationalized by considering that a full set of ν = 2 features contains the same information as that of the starting ν = 1 equivariant set, while its size is formally its square. The IVO selection scheme affords close to optimal (PCA) performance, indicating a high degree of orthogonality (within the dataset) of ν = 2 features with the same symmetry. With IVO, once the selected ν = 2 features are determined on a training set (which consists of the whole dataset in these tests, but not in the learning exercises), the IVO contraction allows for the computation of the retained features only, similarly to the LE high-order truncation, and unlike PCA, where S8 all equivariants need to be computed, and stored, before the contraction step.
The ν = 2 Jacobian condition number tests in Fig. S6 show that, similarly to the ν = 2 variance case, the choice of ν = 2 compression has little consequences compared to the ν = 1 basis, possibly as a result of the strict ν = 1 thresholds, which leave relatively little scope for differences in the ν = 2 selection/contraction. It is also worth noting that the X-OPT-PCA and X-OPT-IVO contractions perform poorly for the low-energy methane structures, confirming that aggressive optimization of a feature set has a negative effect on the Jacobian condition number, which is a measure of the distortion introduced in the transformation from real space to feature space. This is particularly evident from the very high condition numbers generated by the X-OPT-IVO and X-OPT-PCA combinations on low-energy methane.

Learning: up to ν = 2 invariants
In all cases, the learning target was the potential energy of the structures. Each ν = 1 basis was employed in three sizes ("small", "medium" and "large"), truncated with the aim of achieving a very similar number of invariants to that obtained from the three (n max , l max ) truncation parameter pairs mentioned in Sec. 2.1 for GTO (and R-X-OPT). While impossible to match exactly, the number of invariants within bases of the same size was kept nearly the same for all different bases and truncations. For each basis size, all ν = 1 and ν = 2 invariants computable from the density expansion were employed, and the central atoms were not included in the density expansions. The adaptive bases were only allowed to learn from the training set. The learning was performed with linear regression, kernel ridge regression and neural networks in order to demonstrate that the relative performance of the different bases is mostly consistent across different fitting models. For the methane dataset, only carbon-centered features were employed in the learning process, and the feature-related hyperparameters were kept the same as those employed in the expansion coefficient tests. The test sets consist of 2000 structures for the methane sub-datasets and 2931 structures for the carbon dataset.
When learning from the carbon dataset, r cut = a = 5Å was chosen, and a shifted cosine cutoff function (0.5Å wide) was applied in all cases if not otherwise specified. It must be noted, however, that this choice is not optimal for the LE basis and the optimized bases calculated from it, because the LE basis cannot expand finite densities at the edge of the sphere of radius a, where all of its basis functions are zero (Dirichlet boundary condition). As a result of this, and as a means of exploring the effect of more aggressive cutoff functions, the LE calculations were repeated with the f 1 and f 2 cutoff functions described in Appendix A. Similarly, the X-OPT basis was re-calculated starting from these new LE expansion coefficients. Finally, the GTO results were repeated with a shifted cosine cutoff as wide as r cut , which is equivalent to f 2 .
Linear regression The LAPACK dgelsd subroutine was used to perform linear fitting in all instances. Within this subroutine, regularization is performed by excluding the singular values of the feature matrix below a certain threshold given by the rcond parameter times the largest singular value. The optimal rcond was found via a logarithmic one-dimensional grid search with the aim of minimizing the test set error.
Kernel ridge regression Full Gaussian process regression (GPR) was employed with the polynomial kernel k(x i , x j ) = (x i · x j ) 2 , and the only scaling previous to the fitting is that of the mean potential energy, which was set to zero. The optimal Tikhonov regularization parameter was found via a logarithmic one-dimensional grid search with the aim of minimizing the test set error.
Neural Networks In all cases, the feed-forward MLP-type architecture was used. The NNs employed were three layers deep, with each layer consisting of 16 neurons, and tanh activation functions. When learning from the AIRSS carbon dataset, a Behler-Parrinello construction with the same MLP architecture was used. Training of the neural networks was carried out with the Adam optimizer, starting from a 10 −3 learning rate and decreasing learning rates by a factor of 10 after a patience of 50 epochs upon stagnation of the test set error. The optimal point of the learning was either determined by the early-stopping technique or at 10000 training epochs, whichever occurred first. For each point in the learning curves, in order to reduce noise, the results were averaged over 16 training runs starting from randomly initialized weights.
Results Fig. S7 shows the full results of the learning exercises using up to ν = 2 invariants. It can be seen that linear models saturate within the n train range considered due to their inability to represent energy contributions stemming from four-or more body interactions. Conversely, kernel and NN models do not saturate, and they achieve lower RMSEs. While neither is specifically optimized, the performance of NNs and kernels is similar in this context, with GPR having a slight edge for small training sets and NNs achieving slightly lower RMSEs with large training sets. In most cases, the small-sized bases considered here do not include enough features to fit the potential energies optimally, while the medium-sized bases are closer to the optimal resolution. Large bases show signs of overfitting for methane, and they do not improve over the medium-sized bases for carbon.
LE is the best basis when considering learning from random and low-energy methane, and it achieves small RMSEs even in its small version. In contrast, GTO clearly outperforms the other bases in the carbon S10 learning exercises. The fundamental difference between the methane and carbon datasets in this setting is that, while the methane structures only include one coordination shell, the carbon structures with r cut = 5 A include at least three. It is therefore beneficial for the success of the carbon learning exercises to prioritize the closest regions of each atomic neighborhood, which has a more sizeable effect on the corresponding atomic energy as opposed to further regions. All bases except for the GTO basis treat the whole sphere of radius a equally, while the GTO basis functions sample the region closer to the center of the sphere more densely, resulting in better potential energy fits. It should however be noted how the LE basis is still the best among the bases other than GTO, which treat the density in the whole sphere equally. This, and fact that the LE basis outperforms all other bases on random and low-energy methane, suggests that basis function smoothness is a desirable property to achieve low RMSEs.
Hence, for larger systems, where the cutoff radius includes more coordination shells, taking into account the mostly short-ranged nature of the potential energy is also crucial. On the carbon dataset, the inclusion of more aggressive cutoff functions indeed leads to a marked improvement in the RMSEs. In this case, all the aggressively truncated bases (LE, X-OPT, GTO) perform roughly equally well. This is true despite the fact that GTO, but now also X-OPT (due to its density modulation) sample the closer regions more densely than LE, which, however, is likely to compensate due to its higher smoothness.
Contrary to the variance, Jacobian variance, and Jacobian CN tests, truncation of the basis does not seem to have a large impact on the accuracies achieved within the context of these supervised learning exercises. In particular, it is useful to compare X-OPT and R-X-OPT, which consist of the same basis functions truncated in different ways. Indeed, these two show overall similar performance in Fig. S7, with X-OPT performing slightly better on methane (as perhaps expected) and R-X-OPT being more successful on carbon. Once again, this may be explained by the short-ranged nature of the potential: R-X-OPT ignores high-l contributions and focuses on small-l angular channels, leading to partial cancellation of the atomic contributions of the outer coordination shells in each atomic neighborhood and prioritization of the closer regions.
The J-OPT basis shows the worst performance in most cases. This could be related to the very high condition numbers afforded by the J-OPT basis, which are rationalized in Appendix B.

Learning: up to ν = 3 invariants
We also tested the performance of LE and X-OPT (and their generalizations) on linear learning exercises where ν = 3 invariants were also included. For each basis set type, three invariant set sizes were tested: small (around 10 2.5 invariants), medium (10 3 ), and large (10 3.5 ). The invariant set sizes could not be matched exactly to these numbers; hence, a 10% tolerance is allowed in all cases. For each invariant set size and for each ν = 2 truncation type (LE-LE, LE-IVO, LE-PCA, X-OPT-IVO, X-OPT-PCA), a ν = 2 equivariant set is found such that, if combined with its corresponding ν = 1 equivariant set (whose truncation is performed in the fashion described for the ν = 2 equivariant tests), the size of the set of all calculated invariants (with ν = 0, 1, 2, 3) falls within the tolerance intervals described above. It is important to note that this truncation strategy (based on Eq. 29 for ν = 1, 2 and with no truncation for ν = 3 other than that provided by the previous body orders) is the most practically convenient and consistent across all truncation types, but it may not be the optimal choice to obtain low RMSEs. S11 supplemental_figures/learning_nu3.pdf   S8 displays the results of the linear learning exercises using up to ν = 3 invariants. As also noted in the ν = 2 variance and Jacobian condition number tests, our minimalist choice for the ν = 1 truncation (for a given ν = 2 equivariant feature set size) leads to only minor differences between the ν = 2 truncation types. Indeed, the RMSE differences are largely accounted for by the ν = 1 LE/X-OPT comparison, which shows the LE basis yielding significantly lower RMSEs, most likely as a result of its smoothness.
Regarding the effect of different ν = 2 contraction schemes, it can be seen that IVO performs as well as or better than PCA in all cases. While these tests do not show large differences between LE-LE, LE-IVO, and LE-PCA, LE-LE performs slightly better on low-energy methane, a dataset where LE-IVO and LE-PCA deviate more from LE-LE due to the non-uniform distribution of the hydrogen atoms. Moreover, partial results with a non-minimalist choice for the ν = 1 bases, where LE-IVO and LE-PCA are allowed to differentiate themselves more significantly from LE-LE, clearly indicated the superiority of LE-LE in terms of RMSEs. Since the LE-LE combination also provides the cheapest way to select high-ν features, this is clearly the best way to use the LE basis for the machine learning of potential energies.

A unified picture of optimal basis functions
In this section, we present a more unifying framework for the different optimal bases that are compared in the main text, namely the LE, X-OPT and J-OPT bases. On one hand, all sets of basis functions can be thought of as solutions to an appropriate eigenvalue problem. On the other hand, all three can be defined as the smoothest set of basis functions with a different metric of smoothness. The two viewpoints are closely related to one another, and we shall explore those in the first two subsections. Some mathematical details related to the specific implementation will be addressed in the remaining subsection. S12

The Eigenvalue picture
The first perspective is the generalization of the eigenvalue problem, where the LE basis functions are defined as solutions to ∆u = λu. (S1) If we are restricted to a finite number of basis functions, the eigenfunctions corresponding to the largest eigenvalues (rather than the smallest, since we are omitting the minus sign in this version) are selected.
Both the X-OPT and J-OPT bases can be defined similarly as solutions of an eigenvalue problem of the form with suitable linear operators L. As for the Laplacian, the eigenfunctions corresponding to the smallest eigenvalues are selected here as well.
To provide a short summary for readers fluent in the bra-ket notation, the operators for the X-OPT and J-OPT bases are explicitly given as where the sum runs over all atomic environments, i.e. all atoms including every structure in the data set.
To make this more explicit, recall that the atomic densities ρ i (x) around a center atom i are just functions defined on R 3 , or to be more precise and taking the finite cutoff radius into account, the 3-dimensional ball of radius r cut Using these densities, we can define two new functions (one each for the X-OPT and J-OPT bases) that depend on two variables x, x ∈ Ω as where the sum again runs over all atomic environments, and the "gradient version": These functions in two variables can be used to define the two operators C and K as Using these, the first few LE, X-OPT and J-OPT basis functions can be defined as the eigenfunctions of the operators L LE = ∆, L X = C and L J = K with the largest eigenvalues, respectively. Note that C is self-adjoint, while K is so too provided that the densities vanish at the boundaries of the ball.

Smoothest Function Picture
A different way to define the Laplacian eigenstate basis was to consider the Rayleigh quotient Q = Ω |∇u(x)| 2 dx Ω |u(x)| 2 dx .
(S10) S13 The basis functions used in the LE basis, can be defined as the functions minimizing this quotient. This is an alternative method that allows us to see the LE basis functions as the smoothest (as in, varying the least) functions we could choose on which to expand the target density. At the conceptual level, one might imagine that instead of maximizing the overall smoothness over the entire domain, it might make more sense to let the basis functions focus more on regions in which the target densities ρ i are actually present. This is precisely the idea behind the Rayleigh quotient formulation of the X-OPT and J-OPT bases, which together with the LE basis can all be understood as functions maximizing (rather than minimizing, for convenience) a suitable Rayleigh quotient of the form where u|u = Ω |u(x)| 2 dx (S12) is a more compact form to write the denominator in the quotient and the operator L is chosen as in the previous subsection as L LE = ∆, L X = C and L J = K for the LE, X-OPT and J-OPT bases, respectively. At the intuitive level, choosing the basis functions maximizing the Rayleigh quotient using the operator C means to pick the functions that have the maximal descriptive capability where the densities are significant. An analogous interpretation can be made for the J-OPT basis using the operator K. In the rest of this subsection, we try to provide some further insight into this picture. We begin by showing how the Rayleigh quotient introduced in the main text can be brought to this form. The connection to the eigenvalue picture can be seen by explicitly writing out |∇u(x)| 2 = (∂ x u) 2 + (∂ y u) 2 + (∂ z u) 2 (S13) and integrating the three terms by parts, which allows us to write the numerator as Ω |∇u(x)| 2 dx = Ω u(x) · (−∆u(x)) dx (S14) = u| − ∆|u . (S15) Resulting in the previously introduced Rayleigh quotient (with inverted sign for convenience) Similarly, the X-OPT and J-OPT bases can be interpreted as functions maximizing Q X = u|C|u u|u (S17) and Q J = u|K|u u|u . (S18) The interpretation is that these are the functions capturing the maximal amount of the total variation in the atomic densities ρ i (X-OPT) or their gradients (J-OPT), respectively. Explicitly writing out the integrals, the X-OPT Rayleigh quotient becomes while the J-OPT Rayleigh quotient can be rewritten as = 1 u|u dxdx (∇u(x)) c(x, x ). (∇u(x )) (S21) S14