Machine Learning of Free Energies in Chemical Compound Space Using Ensemble Representations: Reaching Experimental Uncertainty for Solvation

Free energies govern the behavior of soft and liquid matter, and improving their predictions could have a large impact on the development of drugs, electrolytes and homogeneous catalysts. Unfortunately, it is challenging to devise an accurate description of effects governing solvation such as hydrogen-bonding, van der Waals interactions, or entropy and conformational sampling. We present a Free energy Machine Learning (FML) model applicable throughout chemical compound space and based on a representation that employs computationally efficient Boltzmann averages for an approximated sampling of configurational space. Using the FreeSolv database, FML's out-of-sample prediction errors of experimental hydration free energies decay systematically with training set size, and experimental uncertainty (0.6 kcal/mol) is reached after training on 80\% (490 molecules). Corresponding FML model errors are on par or better than state-of-the art, physics based, legacy approaches. To generate the input representation for a new query compound, the FML requires approximate and short molecular dynamics runs. We showcase the usefulness of FML through analysis of predicted solvation free energies for 116k organic molecules (all force-field compatible molecules in QM9 database) identifying the most and least solvated systems, and rediscovering quasi-linear structure property relationships in terms of hydrogen-bond donors, number of NH or OH groups, number of oxygen atoms in hydrocarbons, and number of heavy atoms. FML's accuracy is maximal when the temperature used for the molecular dynamics simulation to generate averaged input representation samples in training is the same as for the query compounds. The sampling time for the representation converges with respect to the model's prediction error for both, training and testing.


I. INTRODUCTION
An accurate description of solvation free energy is fundamentally important to rationalizing reaction kinetics and product propensities. Therefore accurate models describing solvation have far reaching utility from drug design to battery development. Computational methods for predicting solvation free energies based on ab initio methods 1-4 , while accurate in principle, impose a substantial computational burden, and are therefore inherently limited when it comes to navigating chemical compound space (CCS). Conversely, more readily available methods based on parametrized forcefields (FFs), implicit solvent models (PCM 5,6 , GBSA 7 , SMD 8 , COSMO 9,10 , COSMO-RS 11 ) or hybrid models (3D-RISM [12][13][14] trade reduced computational expense for lower accuracy w.r.t. experiment. In particular, continuum solvation models exhibit several disadvantages including lack of locality in differentiating atomic environments 15,16 , poor modeling of hydrogen bonding, inaccurate estimates of entropy contributions 17 , as well as poor decoupling between short-range and long-range effects. Still in particular alchemical FF based ap-proaches have become a routine method for free energy calculations 18 . The recent success of machine learning (ML) in the domain of theoretical and computational chemistry due to unprecedented availability of calculated single-point geometry quantum data, has been manifested for challenging molecular problems, such as accurate prediction of molecular electronic properties such as atomization energies 19,20 , application to elpasolites 21 , carbenes 22 , excited states 23 or fragment based learning with AMONS 24 .
Starting with the work of Behler and Parrinello on high-dimensional neural network potentials 25,26 there has been growing interest in applying ML to MD simulations 27,28 e.g.
ML based free energy models, on the other hand, are much less established, and potential applications to explore CCS in terms of thermodynamic properties have largely remained unexplored except for some very recent publications [41][42][43][44] . Here, we introduce a new ML model capable of predicting ensemble averages, such as free energies of solvation ∆G sol . In particular, our free energy ML (FML) model is designed to deliver both, computational efficiency as well as prediction errors which systematically improve with training set size, thereby being able to reach experimental uncertainty levels. The FML model fills, to the best of our knowledge, an important gap in the field of ML for atomistic simulation by explicitly accounting for an ensemble of molecular conformations through Boltzmann averaged representations [45][46][47] , rather than through fixed geometry based representations. FML thus avoids the pitfall of neglecting the variance of ensemble properties when basing predictions on fixed geometries only.
Our paper is structured as follows: We begin by detailing the FML workflow in sec. II A with emphasis on the ensemble based representation. Next we present the results in sec. IV starting with a numerical demonstration of the necessity of the ensemble representation before assessing the accuracy of FML. Finally, we demonstrate the feasibility of the method for high throughput free energy predictions of 116k organic molecules (a subset of QM9 48 ) revealing trends between molecular structure and solubility.

II. THEORY
We propose a representation based on an ensemble of conformers generated through MD sampling. This gives rise to a unique and temperature dependent representation of the system state. The resulting machine learning framework FML constitutes a physics based approach, since this representation is rooted in statistical mechanics.
Furthermore a comparison with state-of-the art solvation models reveals that FML retains the promise of being faster, more transferable and extendable than solvation methods based on conventional fitting of model parameters. While we focus on free energies of solvation we note that the same methodology might open new pathways for ML applications to other ensemble properties such as protein binding or enthalpy and entropy.

A. Kernel Rigde Regression
We use kernel ridge regression 47 (KRR) a supervised ML method 49 , a support vector machine which can also be derived from Gaussian process regression and which non-linearly maps input into a high-dimensional feature space which renders the regression problem linear. The similarity between compounds i and j with representations X i , X j is measured by applying a Gaussian kernel function K, where σ is the kernel-width hyperparameter. The prediction of property p of query compound q is given by, where K(X train. i , X q ) are kernel functions, evaluated on the query and all training compounds with weight coefficients α. The unique and numerically exact solution vector for the optimal set of regression coefficients α is given by, with the vector p containing all values of the target property in the training set and regularization parameter λ.

B. Ensemble Based Representation
While quantum machine learning (QML) is commonly used as a surrogate model for approximate solutions to the electronic Schrödinger equation, i.e. they are associated with exactly one fixed configuration of atoms in a compound (single point), by contrast the free energy is a property of an ensemble of possible configurational states.
Hence, the intriguing question is how to define a physics based representation of a compound for an ensemble property, such as the free energy of solvation. Here, we have used averages of FCHL19 45,46 , a geometry dependent many body representation that includes two and three-body terms where the first term accounts for interatomic distances and the second term for relative orientations of triplets of atoms. In this spirit, representations for thermodynamic properties should be designed similarly, taking into account the ensemble of accessible configurations at a given temperature. We used a thermodynamic ensemble average allowing for a unique definition of a representation given a set of configurational snapshots obtained by short MD simulations at temperature T (s. sec. III). First the FCHL19 representation X({r i }) is computed for all snapshots i before calculating the ensemble average by numerical integration as follows, with N s uncorrelated gas-phase solute samples weighted by the respective Boltzmann factor e −βEi with β = 1

FML -Prediction
Ensemble Repr. and Z is the partition function. Note that the average representation X (T ) depends on T since both the integration domain Γ(T ) over phase space and the Boltzmann factors depend on T .
This representation is similar to what was recently presented in 50 , however, here we rather use experimental free energies for training because of the inherent accuracy deficiencies of implicit solvation models 17 employed in that work. In addition there is also a number of QSAR based approaches for representing conformers [51][52][53] .
Most specifically, we are referring to KRR using a given vacuum geometry as QML and to free energy machine learning using the ensemble averaged FCHL19 representation as FML. Note that unlike most applications of QML we train on experimental data rather than on calculated results from solving quantum mechanical solutions. The methodology of FML is fully transferable to free energies from any level of theory and atomistic and ab initio in the sense that solely atomic configurations are required as an input without need for molecular connectivities. The generation of the snapshots through use of molecular dynamics can be done using ab initio molecular dynamics, or force-field based molecular dynamics just as well. As illustrated in Fig. 1 the prediction of an ensemble property for a query compound q at temperature T depends on the phase spaces {Γ i } of all training compounds i. From this point of view FML infers predictions by combining the information of all the phase spaces by virtue of the average representation X , i.e. as an integral over the configurationally sampled space. In Fig. 2 we show the practical steps for training a FML model for free energies of solvation. First, a set of free energies is calculated, or experimental values are collected from the literature. Subsequently, and for all solute configurations the average over all representations of the trajectories is computed (s. Eq. 4). Finally, the regression weights α are determined (s. Eq. 3).

C. Uniqueness
To illustrate the importance of an ensemble based approach we use QML to predict a hypothetical value of ∆G sol for all possible fixed conformers of several stereoisomers. We emphasize that for an ML model to be consistent the predictions for free energies of solvation of a set of conformers should differ less than the experimental uncertainty. However, in the following we will show that QML based on a single molecular conformation can indeed lead to inconsistent predictions. For training the QML model we use experimental free energies and representation vectors X vac based on vacuum geometries provided by the FreeSolv 54 database.
We randomly select four isomers of C 7 H 10 O 2 and use their complete set of conformers resulting from systematic scanning of all dihedral angles of the isomers 55 to compare the FML and QML predictions. The distributions of the corresponding QML predictions of the free energy are shown in Fig. 3. They reveal that for different configurations of the same isomer I, the predicted free energy ∆G sol can vary by several kcal mol −1 , well above the experimental uncertainty. Since hundreds of conformational isomers may exist for any given medium sized constitutional isomer (the number of conformational isomers shown in Fig. 3 is N (I 1 ), . . . , N (I 4 ) = 976, 661, 761, 13 ) it should be obvious that using a single geometry as the representation may lead to considerable prediction errors. Interestingly, we find (s. sec. IV) that due to an error cancellation QML based on vacuum geometries with weight coefficients α vac can reach MAEs comparable with FML (s. sec. IV A). The reason is that the average of the QML predictions over a set of conformers, denoted by . c is close to the FML prediction, using the average representation over all conformers X c c . This can be seen by inspecting Fig. 3, where the FML prediction of the average representation of all conformers is shown as a dashed line and the average of the individual conformer QML free energies as a solid line. The FML weight coefficients α FML are obtained using the representation average X j (s. Eq. 4) over N train s = 300 MD samples (s. Fig. 2).

D. ∆-Machine Learning
Using ∆-ML 56 we promote solvation model (sm) free energies ∆G sm sol to experimental (exp) uncertainty as follows, using the ensemble representation X as defined in Eq. 4. Note that the FML correction is different for each compound accounting for too negative as well as too positive baseline free energies. The step for promoting free energies to a predicted experimental value ∆G exp sol (FML) is given by, Besides these modifications the workflow remains the same as in Fig. 2.

A. Molecular Dynamics
All MD simulations are performed with OpenMM 57 in vacuum and the NVT ensemble using a Langevin integrator with a friction coefficient of 1 ps −1 and the small molecule FF GAFF2 58,59 with a time-step of ∆t = 2 fs and total simulation time of 2 ns using SHAKE 60 . Partial charges are computed with antechamber 58,59 at AM1-BCC 61 level. An exception are the implicit solvent GBSA 62,63 simulations where we use AMBER 58 /GAFF 59 while all other simulation parameters are the same. MD samples were selected with 2 ps separation which is well beyond the correlation time of ∼0.5 ps.

B. Machine Learning
To optimize the hyper-parameters σ and λ (s. Eqns. 1, 3), we perform 10-fold cross validation on the training set only and validate the performance on the test set using the QML package 64 . The training set contains all FreeSolv 54 molecules except for those 146 test set molecules which are also part of the QM9 database 48 .
Histograms showing the free energy distribution and the size of all molecules in the FreeSolv database are shown in the SI. For numerous FreeSolv compounds no value for the experimental uncertainty is reported, but a default value of 0.6 kcal mol −1 corresponding to thermal energy fluctuations at 298.15 K is assumed which we use as our target accuracy, albeit the total average absolute error of all experimental values is slightly smaller (0.55 kcal mol −1 ).

A. Learning Curves and Free Energies of 116k Molecules
The performance of several ML models is assessed using learning curves i.e. the prediction error is reported as a function of training set size N (s. Fig. 4). Most notably we assess FML, the ensemble average representation (s. Eq. 4) and ∆-ML (s. Eq. 7) for predicting experimental free energies of the FreeSolv 54 database.
In addition we also test the RDKit 65 implementation of the extended connectivity fingerprint 66 (ECFP4) commonly used in cheminformatics as well as a custom QSAR based representation, named CQ (s. SI). To compare KRR with conventional fitting methods we also report the MAEs of a multilinear regression model with CQ. The error bars and colored areas in Fig. 4 show the standard deviation of the MAEs resulting from 10-fold cross validation over the FreeSolv database. We find that feature based models ECFP4 and CQ (s. Fig. 4a) perform worst reaching an MAE of only 0.8 kcal mol −1 for the maximal training set size of N max = 550. This might be due to the fact that feature based representations do not properly weigh the ensemble of molecular conformations which can have a large influence on predictions of ∆G sol (s. sec. II A).
QML on the other hand reaches the target accuracy, the experimentally relevant accuracy of 0.6 kcal mol −1 for about N max = 550 training molecules. The best FML model trained with 10 random MD samples per molecule with T = 350 K essentially results in the same MAE. While both models reach the experimentally relevant accuracy, we find that FML has a slightly smaller offset hitting 1 kcal mol −1 for N ≈ 50 while the QML model trained on the vacuum geometries has an MAE of about 1.5 kcal mol −1 . The otherwise similar performance of both models may to some extent be attributed to the error compensation, as discussed above.
The GBSA values have been computed using AMBER 58,59 (s. sec. III), the TI values are from the FreeSolv 54 database, the values for the two implicit solvation methods 3D-RISM 12,13 and SMD 8 are from the SI of 14 . We find that ∆-ML can lower the offset of the learning curves and thus may be useful if experimental data is scarce. FML requires about 100 training molecules to reach 0.8 kcal mol −1 whereas ∆-ML using TI or 3D-RISM only needs around 50. Note that the SMD baseline model has the lowest offset, since SMD has the smallest MAE of 0.93 kcal mol −1 versus an MAE of 1.12 kcal mol −1 for TI on the FreeSolv database. However, after inclusion of ∼150 compounds in training both approaches, direct FML and ∆-ML, perform similarly and converge to target accuracy.
Similar observations were also reported in Ref. 42 , but we find that the target accuracy can be reached using FML without additional solvation calculations for ∆-ML. We note that some caution is required for this comparison since the accuracy of ∆-ML relies heavily on systematic correlation between baseline and target.

B. Comparison to other models
A comparison of various solvation models with FML is shown in terms of a scatter plot (s. Fig. 5) displaying predictions versus experiment. The training set consists of all molecules in FreeSolv but not in QM9, the overlap (146 molecules) is used as a test set. The MAEs for FML and the tested solvation models are listed in Tab. I. We find that the best FML model with N train s = 10 and T train = 350 K results in an MAE of 0.57 kcal mol −1 reaching the experimentally relevant accuracy of solvation energy measurements and performing slightly better than SMD with 0.61 kcal mol −1 . Note that FML has some significant outliers such as methane with an error of about 2 kcal mol −1 . The largest outlier however is 2-hydroxybenzaldehyde with an error of 3.6 kcal mol −1 . It is worth emphasizing that SMD uses at least 79 of the test set molecules for parameterization which were not included in training of the FML model. Still FML has a slightly smaller MAE despite using fewer training points compared to SMD. Furthermore we find that FML clearly outperforms TI and 3D-RISM reaching an MAE of 1 kcal mol −1 after about N = 100 training molecules. We emphasize that a typical SMD (M06-2X 68 /Def2-TZVPP 69 ) calculation of a molecule in the FreeSolv database with Gaussian09 70 is computationally demanding while a force-field based MD simulation in vacuum to generate the averaged representation can be run on a single modern CPU in less than 10 minutes (s. Tab. I for timing benchmarks).
In addition FML clearly outperforms COSMO 9,10,71 with B-P86 72,73 /TZVP 74 resulting in an MAE of 1.94 kcal mol −1 with higher computational costs. For comparison with COSMO-RS as implemented in COSMOtherm 75 we refer to a benchmark test 11,76 with a set of 274 molecules resulting in an MAE of 0.52 kcal mol −1 reaching the accuracy of FML but at higher computational cost (s. Tab. I) and as for all other DFT methods with a worse scaling with the number of atoms compared to FFs.

C. Predicted solvation for 116k organic molecules
We use FML to calculate free energies of a large dataset of organic molecules, a subset of CCS, in order to identify trends between solubility and structure. More specifically, 116k molecules of the QM9 dataset   have been considered in order to predict the free energy distribution in Fig. 6a. For comparison we also show the corresponding distribution of ∆G sol for the Free-Solv database in the SI. The FML model was trained on the complete FreeSolv database with coincident QM9 molecules removed from training. We find that the free energy distribution is approximately Gaussian with a mean value of ∆G sol = −7.56 kcal mol −1 spanning a range of 25.1 kcal mol −1 . The molecules with the most negative and positive ∆G sol are QM9 compound with index 26712 with −21.63 kcal mol −1 and QM9 molecule with index 118570 with 3.47 kcal mol −1 respectively (s. Fig. 6b,c). The top 50 most soluble and least soluble molecules according to FML are also shown in the SI. We find that the most soluble molecules have a planar ring structure. On the other hand the 4907 hydrocarbons of QM9 occupy the right tail of the distribution. Molecules with many hetero atoms bonded in NH/OH groups in a planar ring structure tend to have a very negative ∆G sol while aliphatic linear molecules tend to be less soluble. This can be explained by the large dif-ference in electronegativity between the H and the N, O atoms leading to polar bonds eventually resulting in Hbonds, lowering the enthalpy of solvation. Alkanes on the other hand have a negligible polarity, and thus only interact with water though much weaker van der Waals interaction resulting in low solubility.
These simple rules for the solubility lead to trends for the average free energy ∆G sol such as approximately linear relations with the number of H-bond donors N N d or the number of OH and NH groups N NH/OH as shown in Fig. 6de. Furthermore ∆G sol decreases with the number of oxygen atoms for molecules with stoichiometry C nC H nH O nO and with the number of heavy atoms N h (s. Fig. 6fg). The former effect is due to enthalpic Hbond contributions from hydroxyl groups and the latter is due to weak van der Waals interactions which roughly scale with the molecule size. Such linear free-energy relationships (LFER) are well known and used for predictive models e.g. for solvation 79,80 or organic chemistry 81 .   Fig. 7b). We find that FML models usually perform best if T train ≈ T test , e.g. model FML(50 K) performs best for T train = T test = 50 K and the MAE increases for higher temperatures T test . On the other hand, FML(450 K) performs much worse at low temperature than at high temperature. This indicates, as suggested by Eq. 4, that the average representation indeed results in models specific for the temperature and phase space Γ(T ) spanned by the training MD samples.

V. CONCLUSION
We have studied the role of the representation for ML models of ensemble properties. As one would expect, numerical results confirm that also the representation should be rooted in statistical mechanics, since ML based on a single molecular geometry can lead to large and spurious prediction errors, as discussed for Fig. 3. The definition of a representation based on Boltzmann weighted averages does not only resolve this issue but also natu-  rally introduces temperature dependence.
In similar manner FML models could be constructed that depend on pressure, or chemical potential, to account for increasingly more realistic canonical and grandcanonical ensembles. The numerical performance of FML is encouraging: FML reaches experimental uncertainty for relatively small training sets and at low computational cost for new query estimates. As such, it is better or on par with state of the art models in the field, and it emerges as a viable alternative whenever sufficient training data is available. Furthermore, we find that ∆-ML can improve the predictions for small training set sizes, however FML can reach experimental relevant accuracy without requiring additional solvation model calculations. Furthermore we stress that it is straightforward to improve the accuracy and transferability of FML by adding more experimental data points. This is not necessarily true for other more conventional solvation models. To demonstrate the usefulness of a transferable FML model, we have predicted solvation energies for 116k organic molecules of the QM9 database (see Fig. 6). The results confirm known trends namely that molecules with a high solubility tend to have many hetero atoms and are arranged in planar ring structures while linear aliphatic molecules tend to have a lower solubility.
We have therefore demonstrated that FML can be used to study solvation throughout CCS, and that it might prove useful to identify additional non-trivial structureproperty relationships.
ACKNOWLEDGEMENT O.A.v.L. acknowledges support from the Swiss National Science foundation (407540 167186 NFP 75 Big Data) and from the European Research Council (ERC-CoG grant QML and H2020 projects BIG-MAP and TREX). This project has received funding from the European Union's Horizon 2020 research and innovation programme under Grant Agreements #952165 and #957189. This result only reflects the author's view and the EU is not responsible for any use that may be made of the information it contains. This work was partly supported by the NCCR MARVEL, funded by the Swiss National Science Foundation.