Towards the design of chemical reactions: Machine learning barriers of competing mechanisms in reactant space

While sophisticated numerical methods for studying equilibrium states have well advanced, quantitative predictions of kinetic behaviour remain challenging. We introduce a reactant-to-barrier (R2B) machine learning model that rapidly and accurately infers activation energies and transition state geometries throughout chemical compound space. R2B enjoys improving accuracy as training sets grow, and requires as input solely molecular graph information of the reactant. We provide numerical evidence for the applicability of R2B for two competing text-book reactions relevant to organic synthesis, E2 and SN2, trained and tested on chemically diverse quantum data from literature. After training on 1k to 1.8k examples, R2B predicts activation energies on average within less than 2.5 kcal/mol with respect to Coupled-Cluster Singles Doubles (CCSD) reference within milliseconds. Principal component analysis of kernel matrices reveals the hierarchy of the multiple scales underpinning reactivity in chemical space: Nucleophiles and leaving groups, substituents, and pairwise substituent combinations correspond to systematic lowering of eigenvalues. Analysis of R2B based predictions of ~11.5k E2 and SN2 barriers in gas-phase for previously undocumented reactants indicates that on average E2 is favored in 75% of all cases and that SN2 becomes likely for nucleophile/leaving group corresponding to chlorine, and for substituents consisting of hydrogen or electron-withdrawing groups. Experimental reaction design from first principles is enabled thanks to R2B, which is demonstrated by the construction of decision trees. Numerical R2B based results for interatomic distances and angles of reactant and transition state geometries suggest that Hammond's postulate is applicable to SN2, but not to E2.


I. INTRODUCTION
To accelerate robotic experimental materials synthesis, design, and discovery 2,3 a reliable operating system is necessary which can deploy robust virtual models of alternative chemical reaction channels. Rapid yet accurate predictions of the kinetic control of reaction outcomes for given reactants and competing reaction channels, however, are still an unsolved problem. Considerable efforts in quantum chemistry were already directed at the development of automated transition state (TS) searches and chemical reaction paths. However, calculation of the relevant parts of potential energy surfaces remains a difficult challenge under active research 4 . To this end, many TS search algorithms have been introduced which can be grouped into single or double ended methods 5,6 . An example of the former is the single-ended growing string method 7 , which uses only the reactant as starting point and then searches minimum energy paths and transition states. Double-ended methods such as nudged elastic band 8,9 or the two-sided growing string method 10 employ both reactant and product geometries, to obtain a TS geometry. While successful, both approaches are computationally demanding, and in practice often limited to small systems with mostly single step reactions 11 . Recent advances in synthesis planning and modern machine learning techniques hold the promise for dramatic acceleration of such numerical challenges 12,13 . Already several artificial neural networks to predict reaction outcomes were introduced (see 14 for a recent review), including work based on molecular orbital interactions of reactive sites 15 , molecular fingerprints (template based) 16 , reaction site identifiers (template free) 17,18 , scoring functions in search trees 19 , sequence to sequence maps 20 , and multiple fingerprint features 21 . However, all these machine learning models rely on experimental records, meaning that they are agnostic of the underlying kinetics which are known to be crucial for reliably predicting reaction outcomes. Neglecting the energetics of chemical reactivity can be problematic, however, due to the reaction rate's exponential dependency on the activation energy (cf. Arrhenius equation).
To use machine learning to go beyond experimental data records and towards more reliable virtual predictions of reaction outcomes for new chemistries, reaction conditions, catalysts, or solvents, access to substantial and systematic relevant training data of fundamental energetics, e.g. encoding kinetic or thermodynamic effects, is required 22 . Very recent first steps in the direction of quantum machine learning applied to reactivity included the prediction of H 2 activation barriers of Vaska's complexes 23 , the effect of nucleophilic aromatic substitution to reaction barriers 24 , the temperature dependency of coupled reaction rates 25 , or the prediction of enantioselectivity in organocatalysts 26 .
In this work, we demonstrate how the reactant-tobarrier (R2B) model effectively unifies the two directions (yield vs. energy) in order to deliver robust predictions of reaction outcomes of competing mechanisms. We show how R2B can be used to predict and discriminate compet-  (1). In gas phase the energy of the transition state often lies lower than the energy of the reactants at infinite separation 1 . Bottom row: Product geometries at infinite separation (6 and 7) and reactant complexes (2 and 3). Properties of interest for this work are activation energies E E a and E S a , reactants, reactant complexes, and transition states. ing reaction channels among two of the most famous text book reactions in chemistry, S N 2 vs. E2 27 (See Fig. 1) using a quantum data set from the literature encoding thousands of transition states obtained from high-level quantum chemistry 28 . Using our R2B model, we complete the data set for undocumented combinations for which transition state optimizers did not converge. We also demonstrate how decision trees based on R2B give actionable suggestions for experiments on how to control which reaction channel dominates, and thus the reaction outcome. On the synthetic chemistry side, an analysis of the predicted activation energies, as well as transition state and reactant complex geometries based on our models suggests that Hammond's postulate is not applicable for E2.

A. Kernel Ridge Regression
Ridge regression belongs to the family of supervised learning methods where the input space is mapped to a feature space within which fitting is performed. The transformation to the feature space is unknown a priori and computationally expensive. To circumvent this problem, the "kernel trick" 29 is applied where the inner product x i , x j of the representations of the two compounds i and j are replaced by the so-called kernel func-tion k(x i , x j ). This results in kernel ridge regression (KRR). A kernel is a measurement of similarity between two input vectors x i and x j . In this work, we used the Gaussian kernel: with the length scale hyperparameter σ and representation x. Using the representation of a molecule as input space, KRR learns a mapping function to a property y est q (x q ), given a training set of N reference pairs (x i , y i ).
The property y est q (x q ) can be expanded in a kernel-basis set series centered on all the N training instances i, where {α i } is the set of regression coefficients which can be obtained as follows: with the regularization strength λ, the identity matrix I and the kernel matrix K with kernel elements k(x i , x j ) for all training compounds. The kernel (K) within a representation stays the same for both reactions and the difference in the R2B models (α) enters in the change of the label (y) 30 .
BoB uses the nuclear Coulomb repulsion terms from the Coulomb matrix representation (CM 36 ), and groups them into different bins (so-called bags) for all the different elemental atom pair combinations. SLATM 37 uses London dispersion contributions as two body term (rather than coulomb repulsion) and Axilrod-Teller-Muto potential as three body term. While the FCHL18 parameterization accounts for one-body effects in terms of the position of the element in the periodic table (group and period) 38 , FCHL19 limits itself to two-and three-body terms for the sake of computational efficiency 35 . Its twobody terms contain interatomic distances R scaled by R −4 , and the three-body terms account for the angular information among all atom triples scaled by R −2 .
All three geometry-based representations have been tested extensively on close-to-equilibrium structures. Since reactive processes, by definition, deal with out of equilibrium structures, we have also included a simple geometry free representation, namely one-hot encoding. This representation has also been used to encode amino acids in peptides for artificial neural networks 39,40 . In one-hot encoding, the representation is a vector of zeros and ones (i.e. a bit vector), where only one entry is non zero per feature. To describe the molecules, we used a bit vector for every substitution site (Ri ∈ {1, 2, 3, 4}, and one for the nucleophiles (Y) and the leaving group (X), respectively. This results in a combined vector containing 6 bit vectors of total length of 27 bits.

C. Training & Testing: Learning curves
To train our R2B models, the data set was split into a training set and a test set to optimize the hyperparameters and evaluate the model, respectively. To get the optimal hyperparameters, we used k-fold cross validation 29 . We divide the training data into k folds and for each fold, we trained on all but one fold which was used for evaluating the model. This procedure was done in an iterative fashion over all the folds. We then calculated the averaged error over these folds. This was done for different combinations of hyperparameters σ and λ.
The input for all the geometry based R2B models was the reactants at infinite separation ( Figure 1 compound 1). For each reaction, different reactant conformers (yielding different reactant complexes, Figure 1 compound 2 and 3) have been reported in the data set 28 .
To obtain a uniquely defined problem for the ML models, we canonicalized the reactant complexes by always choosing the lowest-lying one from the source data base. Using compound 1 the kernel for both reaction channels is the same (K tot ), which contains 2 kernels: one for the molecule (M and M' ) and one for the attacking group (Y and Y' ) as shown in equation 4. Therefore, for both reactions, the same kernel can be used, and the difference in the training enters by the activation energy (y) in equation 3.
Since one-hot encoding does not depend on the geometry, the kernel can be calculated directly for the entire system. In order to measure the accuracy of our R2B models, we picked the best set of hyperparameters and trained the model using different training set sizes N and plotted the mean absolute errors (MAE) vs. N , resulting in learning curves. Using learning curves allowed us to see the learning behavior of our R2B models and compare different representations. The error of a consistently improving ML model should decrease linearly for increasing training set sizes N 41 : where a is the offset (an indicator of how well the selected basis functions fit reality) and b the slope of the learning curve which describes the speed of which the accuracy increases using larger training set sizes. HOT stands for higher order terms which were neglected in this work, as commonly done.

D. Data & Scripts
The data extracted from QMrxn20 28 are available on github 42 . The scripts used to optimize the hyperparameters and to generate the learning curves are also available in the same git repository.

A. Learning Barriers
Conventionally, the first principles based prediction of activation energies requires the use of sophisticated search-algorithms which iteratively converge towards relevant transition state geometries which satisfy the potential energy saddle-point criterion 8,10,55 . The activation energy is then obtained as the energy differences between reactant and transition state geometry. By contrast, our R2B models solely rely on reactant information as input. We trained them using aforementioned geometry based representations BoB 31 , SLATM 37 , FCHL19 35 , as well as one-hot-encoding, to predict activation energies solely based on reactants at infinite separation as input geometries (compound 1 in Figure 1). Resulting learning curves in Figure 2 indicate systematically improving activation energy predictions with increasing training set size N for E2 and S N 2. For both mechanisms, the most data-efficient R2B models (one-hot-encoding) reach prediction errors of 3 kcal/mol with respect to CCSD reference, i.e. on par with the deviation of MP2 from CCSD, already for less than 300 training instances. For 2'000 training instances, the prediction error approaches would 2 kcal/mol. Moreover, the lack of convergence suggests that chemical accuracy (1 kcal/mol) could be reached if several thousand training data points had been available. Insets in Figure 2 show true (E ref a ) vs. predicted (E est a ) activation barriers for both reactions. Barriers in the range of zero to fifty kcal/mol are predicted with decent correlation coefficients (0.89 and 0.94 for E2 and S N 2, respectively). In short, after training on reference activation energies obtained for explicit transition state geometries (taken from QMrxn20 data set 28 ), the learning curves in Figure 2 amount to overwhelming evidence that it is possible to circumvent the necessity for explicit transition state structural search when predicting activation energies for out-of-sample reactants.
The trends among learning curves in Figure 2, are consistent with literature results for equilibrium structures: The accuracy improves when going from BoB to SLATM and FCHL19 for a given training set size 56 . Most surprisingly, however, all R2B models based on geometry dependent representations are less accurate than one-hot encoding. While still unique (a necessary requirement for functional R2B models 57,58 ) one-hot encoding is devoid of any structural information, and its outstanding performance is therefore in direct conflict with the commonly made conclusion that a physics inspired functional form of the representation is crucial for the performance of R2B models 56,59,60 . Relying only on the period and group information in the periodic table to encode composition, other geometry-free representations have also been applied successfully to the study of elpasolite 61 , or perovskite 62 crystal structures. Here, by contrast, onehot encoding provides the compositional information for a fixed scaffold.
One can speculate about the reasons for the surprising relative performance of one-hot encoding. Due to its inherent lack of resolution which prohibits the distinction between reactant and transition state geometry it could be that one-hot encoding represents a more efficient basis which effectively maps onto a lower dimensionality with superior learning performance. In particular, the inductive effect (practically independent of specific geometric details) is known to dominate barrier heights for the types of reactions under consideration 63 , and it is explicitly accounted for through one-hot encoding without imposing the necessity to differentiate it from the configurational degrees of freedom. To get an idea of the inner workings of the one-hot encoding model, we performed a principal component analysis (PCA) of the kernel matrix of the predictions which can go either way, i.e. E2 or S N 2. For this subset it is the difference in activation energy which will determine the kinetically stabilized product. Color coding the first two components by the difference in reference activation barrier labels for the two reactions results in the graphic featured in Fig. 3. Confidence ellipsoids of the covariance using Pearson correlation coefficients encode intuitive clusters corresponding to leaving-group/nucleophile combinations, and suggest that substituents have less significant effect on trends in activation energies. However, the eigenvalue spectrum of the PCA in

B. New barrier estimates
Using one-hot encoding (leading to the most performing model) we have trained two models, corresponding to the 1'286 and 2'361 activation energies of E2 and S N 2 transition state geometries, respectively. Subsequently, these two models were used to predict 11'353 E2 and S N 2 activation barriers for which conventional transition state search methods had failed within the protocol leading up to the training data set 28 . A summary of the difference in these predicted activation barriers is presented in Figure 4, where the x-axis corresponds to the nucleophiles Y, the y-axis to the leaving groups X. For every combination of X and Y, there are 5·5 squares for the functional groups at position R1 and R2. Within these, there are again 5·5 squares belonging to R3 and R4. Each of the squares represents one reaction for a given combination of R1-4, X, and Y. Simple heuristic reactivity rules emerge from inspection of these results: If the nucleophile and the leaving group are Cl, the preferred reaction is S N 2. If the nucleophile and the leaving group are F, the preferred reaction is E2. The functional groups at positions R1 and R2 favour the E2 due to their electron donating properties which disfavour a nucleophilic back side attack in the S N 2 reaction. A comprehensive overview is shown in Fig. 4. The same rules can be observed in Figure 5 which shows the distribution of the differences in activation barrier (∆E a ) of the training, predicted and total data set. The molecules of the extreme cases, largest difference in activation energies, are shown for both reactions, E2 (left) and S N 2 (right). Figure 5 shows a favourization of the E2 reaction of a rate of roughly 75%. These results have to be taken with caution, since this shift in E2 can also have occurred due to the composition of the molecules in the training set, as well as the choice of small functional groups that minimizes steric effects. A more detailed discussion of the training, the data set completion with the R2B model, and trends can be found in the SI.

C. Design rule extraction
So far, most studies based on artificial neural networks aimed at predicting chemical reactions using experimental data do not account for the kinetics of reactions. It is well known, however, that activation barriers are crucial for chemical synthesis and retrosynthesis planning. This is exemplified by a decision tree for the competing reactions E2 and S N 2 in Figure 6. The goal of such trees is to improve the search for better reaction pathways (lower activation barriers), by showing the estimated change in energy when changing functional groups, leaving groups, or nucleophiles. To extract such rules for the design problem, a large and consistent reaction data set is needed. After completing the data set 28 , we are now able to identify (given a desired product) the estimated changes in the activation barrier, when subtituting specific functional groups, leaving groups, or nucleophiles. This way, the yield of chemical reactions can be optimized by getting insights of the effects that functional groups have on a certain molecule. Furthermore, this insight could be used to direct reactions towards the desired product. Figure 6 shows such a possible decision tree to determine the change in barriers while exchanging substituents. Starting from the total data set (left energy level), the first decision considers the functional group NH 2 at position R1. Going down the tree means accepting the suggested change and the respective compounds, while going up means declining and removing these compounds from the data. Depending on which product is sought after, hints to improve the energy path can be found while constantly accepting (going down) or declining (going up) the tree. For example, if the desired reaction is E2, then the best way is to go down on the tree (decision accepted) which adds electron withdrawing groups to the R3 and R4 position, as well as electron donating groups to R1 and R2. In Figure 6 the first decision redirects the barrier towards E2 about ∼8 kcal/mol by adding an electron withdrawing group (NO 2 ) on the α-carbon. On the other hand, electron donating group at the β-carbon favour the E2 reaction because they facilitate the abstraction of the leaving group, which is shown in the second and the third decision, where NH 2 was added in both positions, R1 and R2. In addition to the R2B predictions, which tell you the outcome of a specific combination of one reaction, a decision trees gives simple rules as an coarsened aggregation that can be used in reaction design to achieve a desired outcome.

D. Estimates of reactant and transition state geometries
Additionally to barriers, we analysed the geometries of the transition states as well as the geometries of the reactant complexes 28 . Choosing key geometrical parameters, such as distances, angles, and dihedrals, we were able to train R2B models to learn these properties using the one-hot encoding as representation. These parameters were extracted from the ethylene scaffold defining the key positions of the substituents, leaving groups, and nucleophiles shown in Figure 7 compounds 2 and 3 for the E2 and S N 2 reaction, respectively.
The parameters for the E2 reaction are the C-X distance d x , the C-Y distance d y , the X-C-C angle α, the C-C-Y angle β, and the X-C-C-Y dihedral θ. Similarly for S N 2, we have the C-X distance d x , the C-Y distance d y , and the X-C-Y angle α. For every parameter, a separate model was trained using the one-hot representation. Although this representation does not contain any geo-metrical information, learning was achieved for every parameter. Figure 7 shows the learning curves and as horizontal dashed lines the null model which uses the mean of the training set for predictions. In the same way as for the transition state geometries, we also trained a model for the reactant complexes. Figure 7 shows the learning curves for both, transition states and reactant complexes. The results for both geometries are similar except for the dihedral of the reactant complexes. The poor performance results from the conformer search of the reactants. Compared to bond distances, dihedrals have multiple local minima which leads to larger differences between the reactant and transition state structures. The variance of the dihedrals are significantly higher which makes the learning task much harder. The one-hot representation does not contain any geometrical information and therefore is not able to learn the different geometries only using information about the constitution (R's, X's, and Y's) of the reactant complexes.

E. Hammond's postulate
To investigate Hammond's postulate we took the difference in the predicted geometries for all 7,500 reactions for the five and three parameters for the E2 and the S N 2 reaction, respectively. Then we plotted these values against the activation energies of both reactions E E a and E S a (Figure 8). The distances ∆d x correlate well with the energies. This is explained by the leaving group that is bonded to the carbon atom in the reactant complex and only small changes in distance happens moving towards FIG. 6. Decision tree using extracted rules and design guidelines. Decision tree using the R2B estimated activation barriers to predict changes in barrier heights by starting at all reactions (first energy level on the left) and subsequently apply changes by substituting functional groups, leaving groups and nucleophiles with E2 as an example. Blue dotted lines refer to an accepted change meaning only compounds containing this substituents at the position are considered. Orange dotted lines refer to substitution declined, meaning all compounds except the decision are kept. Vertical lines on the right of energy levels denote the minimum first (lower limit), and the third (upper limit) quartile of a box plot over the energy range. Numbers above energy levels correspond to the number of compounds left after the decision. Lewis structures resemble the decision in question.
the transition state geometry. For the S N 2 reaction, the backside attack of the nucleophile does not allow a broad distribution of angles in the reactant complex and the transition state. Moreover, the changes in geometry between the reactant complex and the transition state are modest. Therefore, the parameter ∆d y for the S N 2 correlates well with the activation energy E S a . The attack of the nucleophile on the hydrogen atom (E2 reaction) allows for a much broader distribution of the position of the nucleophile in the transition state. This makes the learning problem more difficult, especially for a representation not including geometrical information. Therefore predictions for the dihedrals contain large errors.
Angles and dihedrals correlate very poorly with the activation energies because of the low barriers between the different minima along a dihedral for a molecule. This leads to larger differences in these parameters comparing reactant complexes and transition state geometries.
Hammond's postulate typically holds for the end points of an intrinsic reaction coordinate (IRC) calculation 64-66 which leads to a local minimum close to the transition state. Therefore, the reactant only needs a few reorganisations towards the transition state. For geometries that are farther away from the transition state (such as in our E2), Hammond's postulate cannot hold anymore. This means that even though more reorganization steps towards a transition state have to be made, the activation energy is not affected anymore. As a consequence, Hammond's postulate is no longer applicable.

IV. CONCLUSION
We have introduced a new machine learning model dubbed Reactant-To-Barrier (R2B) to predict activation barriers using reactants as input only. This approach renders the model practically useful, as the dependency on the transition state geometry is only implicitly obtained at the training stage, and not explicitly required for querying the model. We find that one-hot-encoding, the trivial geometry free based representation, yields even better results than geometry based representations designed for equilibrium structures. As such, our results indicate that accounting only for the combinations of functional groups, leaving group, and nucleophile of the reaction is sufficient for promising data-efficiency of the model. Using R2B predictions, we completed the reaction space of QMrxn20 28 . Future work could include delta ML 67 to improve these results even further, as corroborated by preliminary results in Ref. 28 , further improvements on the representation (as recently found to lead to improved barrier predictions for enantioselectivity in metal-organic catalysts 26 ), or the inclusion of catalytic or solvent effects 68 .
Using R2B predicted activation barriers, we have also introduced the notion of a decision tree, enabling the design and discrimination of either reaction channel encoded in the data. Such trees systematically extract the information hidden in the data and the model regarding the combinatorial many-body effects of functional groups, leaving groups, and nucleophiles which result in one chemical reaction being favoured over the other. As such, they enable the control of chemical reactions in the design space spanned by reactants. Finally, we also report on geometries of the reactant complexes consisting of different conformers, as well as on R2B based transition state geometry predictions. Using these results, we discuss the limitations of Hammond's postulate which does not hold for the E2 reactant complexes stored in the QMrxn20 data set 28 .

ACKNOWLEDGEMENT
This project has received funding from the European Union's Horizon 2020 research and innovation program under Grant Agreements #952165 and #957189. This project has received funding from the European Research Council (ERC) under the European Union's Horizon 2020 research and innovation program (grant agreement No. 772834). This result only reflects the author's view and the EU is not responsible for any use that may be made of the information it contains. This work was partly supported by the NCCR MARVEL, funded by the Swiss National Science Foundation.