Alchemical and structural distribution based representation for improved QML

We introduce a representation of any atom in any chemical environment for the generation of efficient quantum machine learning (QML) models of common electronic ground-state properties. The representation is based on scaled distribution functions explicitly accounting for elemental and structural degrees of freedom. Resulting QML models afford very favorable learning curves for properties of out-of-sample systems including organic molecules, non-covalently bonded protein side-chains, (H$_2$O)$_{40}$-clusters, as well as diverse crystals. The elemental components help to lower the learning curves, and, through interpolation across the periodic table, even enable"alchemical extrapolation"to covalent bonding between elements not part of training, as evinced for single, double, and triple bonds among main-group elements.


I. INTRODUCTION
Ground-state properties of chemical compounds can generally be estimated with acceptable accuracy using methods such as ab initio quantum chemistry or density functional theory (DFT) 1 . However, these can be computationally expensive and therefore have a limited applicability, especially for larger systems. Alternatively, inductive quantum machine learning (QML) models can infer properties directly, or even predict the electron density which in turn can be used to calculate properties 2 , by training on a large data sets of reference property/compound pairs. ML models can have an exceptional trade off between predictive accuracy and computational cost. For example, in 2017 we showed that QML models can estimate hybrid DFT atomization energies, as well as several other properties, of medium sized organic molecules with prediction errors lower than chemical accuracy (∼0.04 eV)-multiple orders of magnitude faster than hybrid DFT 3 .
The system variables defining the ground-state properties of a given compound are its external potential, a simple function of interatomic distances and nuclear charges. However, when using this information directly to measure similarity results in QML models with rather disappointing predictive power. This can be mitigated by transformation of system variables into "representations". Such transformations can either be designed by human intuition, or be included in the learning problem, e.g. when using neural networks (NN) which include representation learning in the supervised learning task. Letting a NN find the representation has proven to yield models with low out-of-sample prediction errors [4][5][6] . This approach, however, has the drawback that representation and model are intermingled within the NN, making it less amenable to human understanding, interpretation, adaptation, and further improvement. Furthermore, such machine designed representations do not necessarily lead to better QML performance than human design based representations (vide infra).
There are many ways of manually encoding the 3D structure and chemical composition of a compound into a suitable representation. For example, we can represent a compound as a list of interatomic potentials [7][8][9] . Another approach consists of creating a fingerprint of the compound, transforming internal coordinates into a fixed set of numbers. For example, this can be done by projecting the coordinates on to a set of basis functions 10 , or by creating a "fingerprint" from the topology of the structure 11 . Distributions of internal coordinates represent another systematic approach, shown to yield well performing QML models applicable throughout chemical space 12,13 . Additional use of bags containing angular and dihedral distributions has led to further improvements in resulting QML models 3,9,14,15 . Bagging based on atom types, however, severely hinders resulting QML models from transferring what has been learned from one atom type to another-a desirable feature for chemically diverse systems.
In this work we introduce a new atomic environ-arXiv:1712.08417v1 [physics.chem-ph] 22 Dec 2017 ment representation, with two key differences to previous distribution-based work. (i) The representation is not binned by atomic types. Instead, compositional information is encoded directly into the distributions. This allows measuring not only structural differences, but also "alchemical" differences between elements in the atomic environments. The idea of computational alchemy, amounting to continuous interpolation of Hamiltonians of two different systems, is well established in quantum chemistry and statistical mechanics and can be exploited for virtual exploration campaigns in chemical space with increased efficiency 16 . Recently, it has been shown that alchemical estimates of covalent bond potentials can even surpass generalized gradient approximated DFT accuracy 17 . The foundation of a continuous chemical space has been reviewed previously 12 . Alchemical distance measures in the context of QML were already exploited previously when using the Coulomb matrix 7 , Fourier series distribution based representations 18 , the Faber, Lindmaa, Lilienfeld, Armiento (FLLA) crystal representation 19 , and within smooth overlap of atomic potentials (SOAP) representations 20 . For this work we have identified a new functional form with improved performance due to alchemical contributions to the distance measure. (ii) We use a set of multidimensional distributions of interatomic many-body expansions scaled by simple powerlaws, rather than several 1D bins of internal coordinates. The distributions are built recursively, so that an m-body distribution contains the same information as the (m − 1)-body distribution plus additional m-body information. This particular combination combines similarity to the potential energy target function and compliance with many known (translational, rotational, permutational) invariances.

II. THEORY
In this section, we first motivate the ideas which have led to this study. Thereafter, we discuss the functional form and the variational degrees of freedom which we have introduced, as well as the resulting compound distances. Then, an analysis of the functional form is performed using the molecule water as an example. Finally, numerical results for parameters optimization runs are discussed.

A. Kernel ridge regression
In order to profit from robustness, ease of error convergence, computational efficiency, and simplicity, we base our studies preferably on kernel ridge regression (KRR) models [21][22][23][24] . However, we consider this rather a question of taste, and believe that other regressors, such as neural networks, will produce similar results if properly converged.
KRR estimates property p of query compound C as a weighted sum of kernel basis functions placed on each of N training compounds {C k }, where the solution for the weights {α k } are obtained through linear regression with regularizer λ (typically negligibly small because of absence of noise in quantum training data). Note that throughout this work we rely on atomistic (scalable) Gaussian kernels, K(C, C ) = I∈C J∈C k(∆(A M (I), A M (J))), as already used in 14,[25][26][27] . As such, KRR renders the selection of a functional form which represents an atom in its chemical environment mandatory. Obviously, this choice is fundamentally related to our understanding of chemistry, and is known to dramatically affect the performance of resulting QML models, see e.g. 3,9 . It is for this reason that we draw our inspiration from the fundamental laws of quantum mechanics which specify the definition of system (Hamiltonian) and property (Observable), and which spell out the numerical recipe which links the two 1 .
The genesis of this study is due to the fact that the total potential energy, the expectation value of a compound's electronic Hamiltonian, constitutes the central figure of merit for convergence towards the wavefunction by virtue of the variational principle. When considering Eq. (2) it should be obvious that kernel (and thereby representation) are independent of the specific property, units and property dependence are introduced through the regression weights only. This has also already been demonstrated numerically for multiple properties using the same kernel 28 . As such, the role of the kernel is reminiscent of the wavefunction which can be used to predict arbitrarily many observables by evaluating the expectation values of the corresponding operators, always using the same wavefunction: Once the kernel is inverted, arbitrarily many sets of regression coefficients can easily be generated provided that their corresponding property reference values have been provided. The potential electronic energy being the central property in quantum mechanics which defines the wavefunction it is therefore plausible to assume that a representation, optimized for energy predictions only, is fundamentally more advantageous than representations obtained by minimizing prediction errors of alternative observables. Consequently, the focus on this study has been to identify a representation which is inspired by the energy changes occurring due to changes in chemical composition and covalent and non-covalent bonding. The accuracy quantum mechanics when predicting other properties (observables) as expectation values of operators depends crucially on the quality of the wavefunction. Here, we follow a similar argument: The better the representation the better the energy prediction, implying that energy prediction errors can be minimized in the functional space of the represen-tation, enabling systematic convergence towards an ideal representation.

B. Representation
We use a set of interatomic M -body expansions A M (I) = {A 1 (I), A 2 (I), A 3 (I), . . . , A M (I)} which contain up to M -body interactions to represent the structural and chemical environment of an atom I in compound C. A m (I) is a weighted sum that runs over all m-body interactions. Each element in the sums consists of Gaussian basis functions, placed on structural and elemental degrees of freedom, and multiplied by a scaling function ξ m . Structural values encode geometrical information about the system, such as interatomic distances or angles. As elemental parameters we use the period P and group G from the periodic table. The scaling functions ξ m are used to weigh the importance of each Gaussian, based on internal system coordinates. We now consider only the first three distributions in A M (I) for an atom I. We have also derived, implemented and tested the 4-body A 4 (I) distributions. We refer to the supporting information (SI) for the derivation. The predictive accuracy improvements of resulting QML models, however, were found to be negligible in comparison to the 3-body expansion. As such, we Also, the computational cost for generating large kernel matrices increases substantially when going from third to fourth order terms.
The first-order expansion A 1 (I) accounts for chemical composition (stoichiometry) and is modeled by a Gaussian function placed on period P I and group G I of element I: where x (1) I = {P I , σ P ; G I , σ G }, with respective widths σ P and σ G . σ P and σ G can be seen as elemental smearing parameters, which control the near-sightedness of elements in the periodic table. χ 1 and χ 2 represent dummy variables for period and group, to be integrated out when evaluating the Euclidean distance (see Eq. (4)). For A 1 (I), the scaling function is set to unity, since stoichiometry is geometry independent. We are not aware of other representations in the literature which employ similar distribution functions in the periodic table.
A 2 (I) is a product of A 1 (I) and a sum that runs over all neighboring atoms i: where d iI and σ d correspond to the interatomic distance at which a Gaussian is placed, and its width, respectively. ξ 2 corresponds to the 2-body, interatomic distance dependent, scaling function which takes the form of the power laws discussed below. Note that letting σ P and σ G approach zero is equivalent to using a radial distribution function (RDF) for each element pair.
This attribute of the representation holds for any of A m (I). I.e., σ P , σ G → 0 is equivalent to creating a separate distribution for each chemical element m−tuple in A m (I). While atom pair-wise distribution functions are rampant as representation choice, especially for fitting potential energy surfaces of systems with fixed chemical composition, to the best of our knowledge, combining them with scaling functions is novel.
A 3 (I) is the logical extension from A 2 (I), it has a different scaling function with an additional summation, running over all neighboring atoms j: . P j and G j , similarly to P i and G i , corresponds to the period and group of atom j. Again, is the (three-body) scaling function, and θ I ij the principal angle between the two distance vectors r Ii and r Ij which span from I to i and I to j, respectively. σ θ is the width of the Gaussian placed at θ I ij . Letting σ d go to infinity in A 3 is equivalent to using a type of angular distribution function (ADF), which in one form or another has already been used in several representations 3,14,15 . A 3 can therefore be seen as a generalized ADF containing more structural information. Fig. 1 illustrates how A 3 (I) looks for a hydrogen, carbon, and the oxygen atom in ethanol. Three-body distributions are less frequent as representation choice, and again, to the best of our knowledge, combining them with scaling functions is novel.
The scaling functions ξ we have chosen for this work correspond to simple power laws. They have been modified from the leading order two-and three-body dispersion laws by London, 1/r 6 , and Axilrod-Teller-Muto 29,30 , 1/r 9 . Such dispersion expressions were previously already used by some of us 14 . Our scaling functions, however, use different exponents for the radial decay, and set the C 6 and C 9 coefficients to unity, as early tests indicated better performance for this choice. For periodic systems, however, a very large cutoff radius would be needed in order to converge the distances between two atomic environments, when using the optimized exponents. We have therefore augmented the scaling functions by a previously used soft cutoff function 31 , which goes to zero at 9Å.

C. Distances and scalar products
In order to train and evaluate the KRR model in Eq. (1), proper distance measures must be specified. We have found good performance when using as a distance between two atomic environments A M (I) and A M (J) a weighted sum of the distances between each m-body expansion: Here, β m is another hyperparameter, which weighs the importance of each expansion order.
The distances between each distribution term are eval-uated as Euclidean (L 2 ) norms, as shown in Eq.4. ς m are normalization factors, which ensures that all individual basis functions integrates to 1 in the L 2 -norm. All in-tegrals can be solved analytically since they consist of a sum of Gaussian products. The explicit form of the integrals can be found in SI.
Note that third and fourth order terms become prohibitively expensive to calculate directly. However, this can to a large extent be circumvented by slightly modifying the distributions, and solving the angular integrals in Fourier space. Further details about the corresponding equations and derivations can also be found in the SI.

D. Comparison to other distribution based representations
Probably the largest difference in how A represents nuclear configurations, when compared to many of the previously published distribution based representations, lies in the 3-body term (since A 2 (·) is a radial distribution functions if σ P and σ G go to zero). In this subsection, we highlight the differences between A 3 (·), or conventional ADF or RDF for representing the structure of the water molecule.
As ADF, we use A 3 (·) with the limit σ d → ∞, and we model RDF by A 2 (·). Furthermore, no scaling function (ξ 2 = ξ 3 = 1) is used and we let σ P and σ G go to zero, since we only examine how representations distinguish structural differences among different geometries of the water molecule. This results in A 3 (·) and ADF being i =I N (d iI , σ d ) j =i,I N (θ I ij , σ θ ) and i =I j =i,I N (θ I ij , σ θ ) for each element triplet, and RDF being i =I N (d iI , σ d ) for each element pair. Fig. 2 shows how the distance measure changes as one distorts the geometry away from its equilibrium structure. Both, RDF as well as ADF result for oxygen as well as for H in large configurational domains with substantially zero distance to the minimum, implying a severe lack of uniqueness. A 3 , by contrast produces a qualitatively meaningful picture with a single well defined well around the minimum. We have also studied the performance for modeling the energy of the water molecule. In Fig. 3, the training error for atomization energies is shown for a linear kernel KRR model with A 3 (·), ADF, RDF, or RDF + ADF as representations. The linear kernel is used as a difficult test in how far representations can model a nonlinear property, such as the energy, in terms of linear basis functions. The errors are significantly lower when using A 3 instead of the other representations, including RDF + ADF. Generally, potential energy surfaces of a three-atom system cannot be decomposed into many-body terms each as a function of only one internal coordinate (internuclear distance d or angle θ). That is, E(d, θ) = E(d) + E(θ). Using a ADF, RDF or a linear combination of the two however would result precisely in such a model, as well as most force-fields. This also explains the relatively large errors for these representations, as well as unreliable performance of pair-wise potentials when it comes to distorted molecules. A 3 on the other hand does not decouple distances and angles, and can, by construction, model any three-body potential.
These observations give insight as to why our new representation performs better than the other distribution FIG. 2. Heat maps of normalized L2 distances for three representations (RDF, ADF, and our new representation). The color code from black to white indicates a distance range from 0 to 1, respectively. The distances are measured between oxygen (LEFT) and hydrogen atoms (RIGHT) in two different water molecules. One water molecule is being distorted by uniform stretching of both OH bonds (dOH1 = dOH2 = l) and bending (φ). The other water molecule is kept fixed at its equilibrium geometry (cross).
based representations: Using ADF's and RDF's as representations might be able to capture slices of the manybody picture, the fact that there is a linear mapping between A n (·) and a n-body potential energy surface, however, appears to make it easier to improve the performance also for non-linear kernels.
E. Optimization

Hyper-parameters
The use of our representation in combination with KRR yields multiple hyperparameters. While one could, in principle, attempt to optimize all of them, using several data sets, and efficient optimizers, such as gradient, Monte Carlo, genetic or simplex methods, we have found that the problem is sensitive only to a small subset of parameters. As such, the exact choice of many hyperparameters is not critical for the out-of-sample errors, and resulting models perform typically well as long as values are used which have similar order of magnitude. Unless otherwise specified, the following hyperparameter values have been used: σ P = σ G = 1.6, σ d = 0.2, σ θ = π, β 1 = 1, β 2 = √ 8, β 3 = 1.6. For the water cluster and the SSI data set there is no to little variation in chemical composition, and no alchemical smearing has been used.

Scaling powerlaw parameters
We have screened screened radial exponents for the , using atomization energies for a subset of the QM9 dataset in order to identify the optimal exponents. Corresponding learning curves are shown in Fig. 4. First, we have screened ξ 2 , using A 2 as representation, yielding the lowest off-set for n2 = 4. Keeping this exponent for ξ 2 fixed, we then proceeded to screen the exponent ξ 3 in A 3 We found that n3 = 4 corresponded to the best exponents for ξ 3 . We have used these values throughout this work, and unless something else is specified, the optimal scaling functions read,

Alchemical smearing
Parameters associated with the elemental smearing have also turned out to have a strong effect on the predictive power of the QML models. We have therefore screened the corresponding values of σ P and σ G using energy prediction errors for the OQMD and QM9 data set for different training set sizes. These two datasets have been used due to their (relatively) high (OQMD) and low (QM9) chemical diversity in terms of number of differing elements in the the data set. The optimal alchemical Gaussian widths varies only slightly across the two sets, as shown in Fig. 5. A circular Gaussian with width σ P = σ G =∼1.6, which amounts to ∼90% overlap between neighboring elements, corresponds in a relatively deep well with minimal MAE for the OQMD dataset, no matter the training set size. The fact that the optimal width stays constant with respect to training set size is beneficial: The elemental smearing can be optimized using relatively small training sets, and can then be applied to larger training sets. Comparing the MAE from a model with σ P = σ G = 0.1 (which in practice is equivalent zero overlap between different atomic types), using the optimal σ P = σ G lowers the MAE by ∼9.9% for the OQMD data set at 100 training samples, which increases up to ∼34% when 1k training samples are used. Prediction errors for the QM9 data set indicate similar behavior, yet much less pronounced. For the largest training set (1000 molecules), the optimization well becomes very shallow, consistent with the lack of compositional diversity in QM9.
Unsurprisingly, datasets with higher chemical diversity benefit more from using the optimized elemental widths. It may therefore not always be beneficial to include any elemental overlap, especially for datasets with low elemental diversity, as it is computationally more expensive to do so.

III. DATA SETS
We have used multiple datasets to benchmark outof-sample accuracy of energy predictions of our model. These datasets includes organic molecules, crystals, biomolecular dimers, water clusters, and main-group diatomics. Some of the datasets are high-quality, have already been published and are in widespread use. Additional low quality data sets have been generated, merely in order to accumulate additional evidence for the relative improvement of the new representation. Since test set predictions are always close to zero by construction, we exclusively report prediction errors as out-of-sample errors (averaged through cross-validation) with respect to reference validation numbers. All errors reported correspond to at least 10 cross-validation runs for each training set size.

A. Organic molecules: QM9
The QM9 dataset 32 corresponds to hybrid DFT 33 based structures and properties of 134k organic molecules with up to nine atoms (C, O, N, or F), not counting hydrogen. SMILES strings of these molecules correspond to a subset of the GDB-17 dataset 34 . The 3k organic molecules, which fail SMILES consistency tests 32 , were removed before use.
A random subset of 22k molecules was selected from QM9 for training and testing. 2k molecules were used for testing, and up to 20k for training.

B. Organic molecules: QM7b
Due to widespread use we also included the more established QM7b dataset 35 . QM7b was also derived from GDB 36 . It contains hybrid DFT (PBE0 37,38 ) structures and properties of ∼7k organic molecules with up to seven atoms (C, O, N, S or Cl), not counting H. We have drawn at random up to 5k molecules for training, and 2k for testing.

C. Biomolecular dimers: SSI
For intra-molecular and non-equilibrium interactions we used a subset of 2356 neutral dimers from recently published protein-sidechain-sidechain interaction (SSI) dataset Burns et al. 39 . The SSI dataset is a collection of dimers mimicking configurations of interacting aminoacid sidechains as observed in a set of 47 high-resolution protein crystal structures. The energies correspond to the DW-CCSD(T**)-F12 level of theory 40 .

D. Water cluster
We also include a dataset which we calculated for 4'000 snapshots drawn from a molecular dynamics trajectory of a water cluster consisting of 40 water molecules. For the molecular dynamics we used the NET-ensemble at 300K, the Tip3p potential 41 , as implemented in CHARMM C41a1 42 . The energies correspond to an optimized combination of basis-sets and DFT functional (PBEh-3c) 43 , as implemented in Orca 44 , obtained for each of the 4'000 geometries.

E. Solids: OQMD
We have used the Inorganic Crystal Structure Database 45,46 subset corresponding to the open quantum materials data base (OQMD) by Wolverton and coworkers 47,48 . This data-set has already been used to develop and benchmark random forest based QML model (Voronoi) Ward et al. 49 . The dataset consists of ∼30k crystal structures and formation energies, calculated using high-throughput DFT (GGA+U). We have used a random subset consisting of 3k structures with less than 40 atoms in the unit cell and formation energies lower than 5 eV/atom for training and testing. 1k crystals were used for testing, and up to 2k for training.

F. Solids: Elpasolites
We have also tested our representation for the Elpasolite crystal structure data set 19 . This data set consists of ∼ 10k Elpasolite structures and DFT (PBE 50 ) formation energies. The crystals correspond to quaternary main group elemental composition with all elements up to Bismuth (39 in total). We have used a random subset of 7k structures, with up to 6k and 1k for training and testing, respectively.

G. Maingroup diatomics
To test the predictive power for alchemical interpolation we have also included a set of previously published DFT (PBE 50 ) results for single, double, and triple bonds among main-group diatomics saturated with hydrogens 17 .

IV. RESULTS AND DISCUSSION
Using learning curves (always resulting in straight lines when recorded on log-log plots due to their inverse power law relationship ? ), we first present numerical results which indicate the predictive power of our QML model for atomization and formation energies in various data sets. When available for the same data set, we also compare to other QML models in the literature. Thereafter, the alchemical extrapolation capacity is demonstrated for predicting covalent bonds in molecules with elements that were not part of training. Finally, log-log plots of learning curves for nine electronic ground-state properties of organic molecules (QM9) are reported and discussed.
A. Energies of molecules, clusters, and solids Fig. 6 displays the performance overview for energy predictions on six different data sets (QM9, QM7b, SSI, water, elapsolites, OQMD). Mean absolute out-of-sample energy prediction errors are shown as a function of training set size. The results indicate remarkable performance for all data sets, indicating a well working QML model yielding systematic improvement with increasing training set size. The learning curves also indicate out-ofsample MAEs which are consistently lower, or similar, than previously published models in the literature. For QM9, the MAE reaches the highly coveted chemical accuracy threshold (1 kcal/mol or ∼ 0.043 eV for enthalpy of formation) with only 2k training points on the QM9 dataset. Previously published QML models had to include an order of magnitude more training molecules to reach such accuracy. This is similar to the amount of training molecules necessary when using the Coulomb matrix representation in conjunction with semi-empirical or DFT based baselines in order to estimate electron correlated energies, as demonstrated in 2015 with the ∆-ML model 51 .
For QM9, aSLATM 14 and SOAP multi kernel model 20,52 reach a performance nearly as good as our QML model. aSLATM, however, performs worse for the SSI and the Water cluster. The SOAP multi kernel QML model, however, performs an expansion in kernel function space acting on the distance for which all degrees of freedom have already been integrated out. As such it is, strictly speaking, not the same as as an improved representation, but rather an improved regressor. Note that single kernel based SOAP QML models perform significantly worse. The reader should take notice however that in the SOAP learning curve results presented in Fig 6, the ∼3k structures which had failed the SMILES consistency test, were included. As such, theses QML models are not exactly comparable, and the SOAP results are still likely to slightly improve if these faulty structures were to be removed. One should also note that the SOAP results shown for QM7b correspond to the multi-kernel SOAP kernel 20,53 .
The MAE of our new QML model is consistently the lowest for all data sets and large training sets. For the set of 4,000 non-equilibrium water clusters, there is a noticeable difference between the global (CM, BOB and SLATM) and the atomic representations (i.e., aSLATM and the new model we introduce in this work): The global models exhibit very little learning at first, only for larger N the learning curves begin to turn downward. The atomic models, however, our new representation based QML model as well as aSLATM, improve rapidly with increasing training data set size. We believe that sorting and crowding in the global representations makes it difficult to accurately account for the purely geometrical changes in structures that contribute to total energy variations.
Impressive predictive power is also observed for the OQMD dataset, a structurally and compositionally very diverse set of solids. Our new model has a lower out-ofsample MAE for all N when compared to the sine matrix representation on the OQMD dataset. The offset of the learning curve of our new model is larger compared to that of the Voroni-based random-forest model Ward et al. 49 . However, the learning rate of our QML model is significantly steeper, surpassing the Voronoi model already at just ∼ 250 training samples. Results for a solid state variant of the CM, designed for use in periodic systems, has also been included (SineMatrix) 54 . It has a similar slope as the Voronoi model, but a substantially larger off-set.
For the elpasolite data set, 19 , with large composition diversity but identical crystal structures, the learningcurve of the FLLA representation has a slightly higher off-set than our new QML model, yet exhibits a steeper learning curve. Our model converges towards the same slope for larger training set sizes. We can only speculate on the reasons for such behavior. The FLLA representation differs qualitatively from the other representations in this study: It does not include any explicit information about coordinates and only encodes periodic row and column of the elements which populate each crystal structure site. The QML model then learns to infer ground state energies without knowing the exact configuration. This leads to a very low dimensional model that is still unique for the system, which might be the cause of the lower slope. This however needs to be investigated more carefully before any conclusions can be drawn.

B. Alchemical predictions
Our new scaled many-body expansion explicitly accounts not only for distributions of interatomic distances and angles but also for elemental distributions in the periodic table. We have therefore studied its capability to predict covalent binding of molecules containing chemical elements which were not present in the molecules used for training. More specifically, we have investigated single, double, and triple bonds with one bonding atom coming from group (IV), i.e. C, Si, or Ge. In order to increase covalent bond order, we have varied the valency of their bonding partner as follows: For single bonds, group IV atoms are bound to halogens (group VII). For double bonds, group IV atoms atoms are bound to chalcogen atoms (group VI), and for triple bonds, group IV atoms are bound to group V atoms. Dangling valencies of group IV atoms have been saturated with hydrogen. Similar covalent bonding potentials have also recently been used in order to assess the predictive power of first and second order perturbation theory based alchemical predictions 17 .
In order to test the alchemical "extrapolation", we trained on the covalent bonds of all other compounds (16 curves) which did contain neither the group IV atom nor the corresponding bonding partner in question. The predictive power for the out-of-sample molecule, on display in Fig. 7, is impressive. Albeit not quantitative (chemical accuracy is not reached), the results are semiquantitative and certainly provide a physically very adequate picture of the covalent bonding in single, double, and triple bonds for main-group atoms in periods 2 to 4. The fact that predictions for the central elements H 2 SiS are more accurate (easier to interpolate) than others is consistent with this interpretation. We also note that the deviation is the worst for 2nd-row elements (due to lack of d-orbitals they differ substantially more from 3rd and 4th row than 3rd and 4th row differ from each other). Because of their poor performance we have not included other representations in this test.
These results clearly demonstrate that alchemical extrapolation is possible when interpolating elemental groups and periods in the periodic table through an appropriate representation. Since the representation is continuous in the corresponding compositional space, we also believe that indication is given that the calculation of alchemical derivatives is meaningful, similar in spirit to Ref. 55 .

C. Other ground state properties of molecules
Finally, we also investigated how well QML models based on our new representation, optimized for energies, performs for predicting other ground-state quantum properties part of the QM9 dataset. More specifically, we have included atomization energies, HOMO, LUMOeigenvalues as well as gap, dipole moment, polarizability, zero point vibrational energy, heat capacity, and the vibrational frequency of the highest lying fundamental (ω). Results are shown in Fig. 8, and provide overwhelming evidence that resulting models enable predictions systematically improving with training set size, no matter what property. For comparison, we have also included results for the aSLATM model. aSLATM results are typically worse when dealing with extensive properties, such as energies, polarizability, or heat-capacity. When dealing with intensive properties, such as eigenvalues or dipolemoment, aSLATM is on par or even slightly better than our model, with the exception of ω. ω corresponds to frequency associated to the vibrational stretch of CH, NH, or OH bonds, a property with hardly any variance at all. Previously we have seen that this property is best predicted by a random forest model which has poor performance for all other properties 3 . The interpretation is that predicting this property is much more a classification problem, then a supervised learning task. Fig. 8 also includes learning curves for the root mean squared error, indicating the slightly higher offset than the mean absolute error, to be expected, and systematic improvement with training set size with similar slopes as the mean absolute error. This is an assuring result, indicating once again, that also predictions for outliers improve as training set size is increased 51 .
Furthermore, for Fig. 8 we have also distinguished between two and three body contributions (as well as fourbody for BAML). For all properties but for ω the trend meets the expectation, as also already confirmed previously for BAML 9 : Addition of the higher order term systematically lowers the learning curves by a significant amount. CONCLUSION We have introduced a universal representation of an atom in a chemical compound for use in QML models. An atom is represented by a sum of multidimensional Gaussians, each term corresponding to elemental, atom-pairwise, and angular distributions and scaled by respective power laws. For the compounds and properties studied we have found four-body contributions to be insignificant. System-independent hyperparameters, such as exponents in scaling functions and Gaussian widths have been optimized using the out-of-sample prediction error for the energy as a penalty. Analytical expressions have been derived for corresponding distances between arbitrary chemical compounds. These distances can directly be used within kernel ridge regression based QML models of electronic ground state properties. For energies of organic molecules, water clusters, amino-acid side chains, and crystalline solids the resulting QML models lead to learning curves with very low off-set and steep learning rate. For compositionally diverse systems chemical accuracy (∼1 kcal/mol) can now be reached using only thousands of training instances. We have also studied the effect of explicitly accounting for inter-elemental distances in the periodic table: Our new QML model can produce semi-qualitatively accurate covalent bonding potentials for single, double, and triple bonds which include chemical element-pairs which were not part of training. For various electronic ground state properties of organic molecules, numerical results indicate that our new QML model has remarkable predictive power. While the reference data used in this study has mostly been obtained at the hybrid DFT level of theory, the steep learning curves of our QML models suggest that it has now become a realistic possibility to obtain a sufficiently large training set at post-Hartree-Fock level of theory (or from experiment), and to use it for the training of QML models which enable subsequent highthroughput screening efforts with similar accuracy.
Combining our new representation with the recently proposed amon approach will provide the possibility to obtain highly accurate QML models which are scale invariant, i.e. they can be applied to systems containing arbitrarily many atoms 14 . Future work will deal with forces and other properties.