Deep Coarse-grained Potentials via Relative Entropy Minimization

Neural network (NN) potentials are a natural choice for coarse-grained (CG) models. Their many-body capacity allows highly accurate approximations of the potential of mean force, promising CG simulations at unprecedented accuracy. CG NN potentials trained bottom-up via force matching (FM), however, suffer from finite data effects: They rely on prior potentials for physically sound predictions outside the training data domain and the corresponding free energy surface is sensitive to errors in transition regions. The standard alternative to FM for classical potentials is relative entropy (RE) minimization, which has not yet been applied to NN potentials. In this work, we demonstrate for benchmark problems of liquid water and alanine dipeptide that RE training is more data efficient due to accessing the CG distribution during training, resulting in improved free energy surfaces and reduced sensitivity to prior potentials. In addition, RE learns to correct time integration errors, allowing larger time steps in CG molecular dynamics simulation while maintaining accuracy. Thus, our findings support the use of training objectives beyond FM as a promising direction for improving CG NN potential accuracy and reliability.


I. INTRODUCTION
Molecular dynamics (MD) simulations are a popular tool for studying bio-physical processes at the nanoscale. For atomistic (AT) simulations, time and length scales of many processes of interest are still out of reach on currently available computational hardware. Coarse-graining [1][2][3][4][5][6][7][8] (CG)grouping AT particles into effective interaction beads -is a common approach to model these systems as larger spatiotemporal scales can be reached due to a reduced number of interactions and an increased time step size 7 .
The fidelity of CG simulations strongly depends on the employed CG potential energy function that defines particle interactions. In the classical CG literature, potentials follow simple functional forms 3,7 . Recent years have seen an increased use of neural network (NN) potentials [9][10][11][12][13][14][15][16][17][18][19] for CG models -both for bottom-up learning to match properties of AT models [20][21][22][23][24][25][26] and for top-down learning to match experimental data 27 . In the following, we focus on the bottom-up learning case with the aim to obtain a CG model that is consistent with an existing AT model. Consistency is achieved if the distribution of CG states sampled from the CG model equals the distribution generated by the AT model when mapping the AT states to CG coordinates 4 . In this case, the CG potential equals the many-body potential of mean force (PMF) 4 . Consequently, NN potentials are a natural choice for CG potential energy functions: Their many-body nature 28 allows for a more accurate approximation of the PMF than classical CG models, promising CG simulations at unprecedented accuracy.
So far, most bottom-up CG NN potentials have been trained via force matching (FM) [20][21][22][23][24] . FM minimizes the difference between CG force predictions and corresponding target AT forces for a given data set 4,21 , typically generated by an AT MD simulation. FM training, while computationally inexpensive and straightforward to implement, suffers from two problems caused by the low availability of high energy states 29 . First, reproducing the ratio of different meta-stable states proves difficult for CG NN potentials trained via FM 24,26 : The CG potential is thought to be susceptible to errors in rarely sampled transition regions that affect the global accuracy of the free energy surface (FES) 26 . Second, NN potentials are physics-free universal function approximators.
Thus, they heavily rely on prior potentials that enforce qualitatively correct force predictions outside the training data distribution to avoid unphysical states, e.g. particle overlaps 21,23 . Both of these problems can cause erroneous results in subsequent CG MD simulations, but critically, their extent is not reflected in the FM validation error during training 26 .
Given these drawbacks of FM in practice, recent efforts focused on training schemes beyond FM, including noise-contrastive estimation 25, 30 and flow-matching 26 . Another alternative to FM is relative entropy (RE) minimization 5 , which has been frequently used to optimize classical CG models 31-34 , but has not yet been applied to CG NN potentials. The main conceptual difference between RE and FM lies in the availability of molecular states during training: While FM trains exclusively on states provided by the AT model, RE additionally samples states from the CG model 5,35 . Sampling the CG model at each update step is computationally expensive, but it gives direct access to the CG distribution during training. Thus, deviations from the AT distribution can be accessed and subsequently corrected via gradient descent optimization -subject to the functional form of the CG model and the statistical sampling error. Thus, in the context of CG NN potentials, RE counters suboptimal global FESs and sensitivity to prior potentials.
In this work, we demonstrate the effectiveness of optimizing CG NN potentials via RE minimization. To this end, we train the CG DimeNet++ 13,14 graph NN potential for the benchmark problems of liquid water and alanine dipeptide. For liquid water, both FM and RE yield highly accurate CG potentials, but RE allows larger time steps in subsequent CG MD simulations without compromising accuracy. For alanine dipeptide, the RE method results in a more accurate FES and is more robust to the choice of prior potential compared to FM. Finally, we showcase that pretraining via FM allows to reduce the computational cost of RE training. Hence, the exploitation of training targets beyond FM is a promising path towards next generation CG NN potentials.

II. METHODS
We reiterate fundamentals of FM 4,36-38 and RE minimization 5,35,[39][40][41][42] theory, based on which we discuss specific properties of both methods in the context of CG NN potentials. The starting point for CG modeling is the selection of a mapping function M which maps AT coordinates r ∈ R 3n onto a lower-dimensional set of CG coordinates R ∈ R 3N with N < n. In the following, we assume canonical (NVT) ensembles in equilibrium and M to be a linear function, even though generalizations to non-equilibrium systems 43 where ... AT indicates an AT ensemble average. p CG θ (R) depends on model parameters θ via the CG potential U CG θ (R). The CG model is consistent with the AT model if the CG potential equals the many-body potential of mean force (PMF) 4,5 where β = 1/(k B T ) with Boltzmann constant k B and temperature T . C is an arbitrary constant that we omit in the following. To approximate the PMF, the most popular methods are the FM 4,36,37 and the RE minimization 5 method.

A. Force Matching
FM -also known as multiscale coarse-graining 4,36,37 -aims to match the CG forces −∇ R U CG θ (R) to the instantaneous forces acting on CG particles F AT computed from the AT system. Thus, FM minimizes the mean squared error (MSE) loss function where ||...|| is the Frobenius norm. In practice, the AT ensemble average is approximated by the mean over a reference data set of AT configurations, typically generated by an AT MD simulation 37 . Minimizing the loss in eq. 4 represents a standard supervised learning problem, which is solved by computing the gradient ∇ θ χ 2 (U CG θ ) via automatic differentiation for a minibatch of AT configurations and updating θ via a stochastic optimizer 21 .
To connect U CG θ to the PMF, eq. 4 can be reformulated 4 as .
Note that χ 2 (U PMF ) depends on the CG mapping M, but cannot be optimized via θ. Thus, FM minimizes the first term, resulting in the force predictions of the CG potential to approximate the forces of the PMF. For infinite data and model capacity, U CG θ therefore converges to U PMF (up to an additive constant). From a ML perspective, the second term in eq. 5 corresponds to the noise

C. Linking Force Matching and Relative Entropy Minimization
A large body of literature has studied the relationship between FM and RE 35,42,44,46 , which we reiterate in parts in the following. Defining the quantity 45 allows reformulating the optimization objectives of RE (eq. 6) and FM (eq. 5) 46 to Hence, both FM and RE minimize a functional of Φ θ (R). Differences in the learned CG potential result from minimizing an average of Φ θ (R) in RE compared to an average of ||∇ R Φ θ (R)|| 2 in FM.
Thus far, the comparison of FM and RE has generally focused on the case of finite model capacity and infinite AT data: FM reaches the minimum of χ 2 (U CG θ ) (eq. 5) if U CG θ is the projection of U PMF onto the function space spanned by the CG potential basis set 37,38 . However, the resulting U CG θ is not guaranteed to reproduce any AT correlation function mapped to CG coordinates 46 . In contrast, the CG potential that minimizes S rel (U CG θ ) is guaranteed to reproduce all mapped AT structural correlation functions that are conjugate to basis functions of the CG potential 35 . For example, if the CG potential includes a flexible parametrization of pairwise interactions, the radial distribution function of the mapped AT system will be matched.

D. Finite data size effects
In the following, we compare FM and RE in the context of NN potentials, where we expect a reduced impact of the finite functional basis set, but a larger contribution from finite data size effects. We assume the common case of a data set generated by an equilibrium AT MD simulation.
Consequently, the data set primarily contains states in potential energy minima, but rarely high energy states 29 . This gives rise to two issues in training CG NN potentials via FM: the difficulty of obtaining a globally accurate FES 24,26 and the reliance on prior potentials 21,23 (discussed in the next section).
FIG. 1. Coarse-graining thought experiment. If the atomistic data set (brown crosses) contains no states within the transition region T , a candidate potential U CḠ θ , whose shape only differs from the potential of mean force U PMF within T , yields the same force matching validation loss as U PMF despite resulting in a different coarse-grained distribution p CḠ θ (R) = p AT (R).
Inaccurate FESs can be caused by sensitivity of the learned potential to errors in sparsely sampled transition regions, as recently hypothesized 26 . We illustrate this idea through a thought experiment, where a system is coarse-grained to a 1D CG coordinate X ( fig. 1). We consider a specific CG potential U CḠ θ that differs from U PMF within the transition region T . Outside T , U CḠ θ is only shifted with respect to U PMF . If we assume that the AT data set does not contain any states within T , the validation FM loss of U CḠ θ is identical to the validation FM loss of U PMF , given that the forces outside T are identical. However, the probabilities of samples generated by both potentials differ, e.g. U CḠ θ preferentially samples the left minimum. FM needs to infer the free energy difference between minima by integrating the mean-force along the transition path, which is unavailable in this thought experiment. In real-world applications, this mean-force integral is determined by few and noisy 21 transition states in the AT data set, which explains the reported difficulty in reproducing the correct relative sampling probabilities of different meta-stable states 24,26 . Since transition states are comparatively rare in the training data, they only have a small impact on the FM validation loss 29 . Therefore, the FM validation loss is not a useful metric to assess the global quality of the FES 24 .
In contrast, the incorrect CG distribution p CḠ θ (R) generated by U CḠ θ results in a large S rel (eq. 6). Thus, RE minimization will adjust the potential such that both meta-stable configurations are sampled equally, matching U PMF where AT data is available. This is consistent with the interpretation that optimizing S rel (U CG θ ) minimizes the difference between the potential energy surfaces of the AT and CG models 40 where a constant offset between the potential energy surfaces is captured by ∆ θ (r) AT . In sum, RE is better suited to reproduce the global FES, especially if the phase-space is resolved inhomogeneously by the training data, making RE minimization more data efficient 26 .

E. Prior Potentials
Classical CG potentials typically use physics-based functional forms 3,7 that enforce qualitatively correct behavior irrespective of the specific parameter values at hand; for example Lennard-Jones interactions encode the Pauli exclusion principle at short distances and van der Waals forces at longer distances. To encode physically meaningful behavior in a similar way, the flexible functional form of NN potentials can be combined with a physics-informed prior potential U prior (R) 21,23,24,26,27 : In this formulation, training the NN potential U NN θ (R) can be interpreted as ∆-learning 47-49 with respect to U prior (R) 21 .
Note that the role of U prior (R) differs significantly for FM compared to RE minimization: Since the data set is obtained via physically sound principles in an AT MD simulation, it does not contain any unphysical configurations such as overlapping particles. In such unphysical regions of phasespace, the CG NN potential therefore operates in the extrapolation regime and can easily predict short-range attraction instead of physically sound repulsion. For FM, a well-chosen U prior (R) therefore enforces qualitative correct predictions outside the training data to drive the CG MD simulation back into the AT data distribution, where U NN θ (R) is accurate. Hence, FM requires careful selection of U prior (R) given that weak choices can lead to unphysical CG MD simulation results 50 .
In contrast, a strong deviation of p CG θ (R) from p AT (R) caused by an unphysical trajectory leads to a large S rel (U CG θ ), which can be corrected by the optimizer during training. Rather than stabilizing the application CG MD simulation, the role of U prior (R) in RE minimization is to speed-up training convergence. Without a prior, physical principles need to be learned from the AT reference data, which significantly increases the number of update steps until convergence 27 .

F. Finite time step effects
So far, we have implicitly assumed ideal sampling of the Boltzmann distribution corresponding to a specific potential, i.e., assuming an infinitesimal MD simulation time step ∆t. However, in practice, the AT distribution p AT ∆t AT (r) results from the time step-dependent shadow Hamiltonian 51 of the reference AT simulation 26 (with the shadow temperature 52 representing the conserved quantity in the NVT ensemble). Thus, RE learns a CG potential whose shadow Hamiltonian yields a CG distribution p CG θ,∆t CG (R) that approximates p AT ∆t AT (R). Consequently, assuming infinite data and model capacity, the optimal RE potential differs from the true PMF as a function of ∆t AT and ∆t CG .
On the other hand, FM models train on a data set of forces computed from the true AT potential.
Hence, the ideal FM potential equals the true PMF, but the resulting CG distribution p CG θ,∆t CG (R) will differ from the analytic p AT (R) as a function of the production time step ∆t CG .

G. Neural Network Potential
We use our previously published implementation 27 of the graph NN DimeNet++ 13,14 as U NN θ with graph cut-off radius r cut = 0.5 nm. We set all hyperparameters to the default values of the original implementation 14 , except for embedding sizes, which we reduce by factor 4. With the default of 4 interaction blocks, DimeNet++ captures up to 8-body correlations 28 as angles are a direct input quantity that already capture 3-body properties. This high-body interaction capacity promises highly accurate approximations to the PMF.

A. Liquid Water
We choose the classical benchmark problem of CG liquid water to test FM and RE in a setting where AT reference data is abundantly available. We generate a 10 ns AT trajectory of the We select a CG mapping, where each water molecule is mapped to a CG particle located at its center of mass. To the DimeNet++ U NN θ (R) we add the pairwise repulsive part of the Lennard-Jones potential as prior where we sum over all N pair pairs with distance d i < r cut (eq. 12). Analogous to our previous We evaluate the quality of force predictions of both trained models based on the held-out test data set ( fig. 2). Compared to AT NN potentials, predicted forces exhibit larger errors due to the noise resulting from the non-injective CG mapping (eq. 5). The FM model yields slightly better force predictions (R 2 = 0.695) than the RE model (R 2 = 0.670). Apart from possible overfitting of the FM model onto forces, this result likely stems from finite time step effects discussed above 26 : The reference forces are computed from the true AT potential, which is consistent with the optimization objective of the FM method. In contrast, the RE potential needs to account for the shadow Hamiltonians of the AT and CG simulations. model. Overall, these results suggest that the difference between models obtained via FM and RE tends to increase for decreasing adequacy of the functional basis set for a given system -in line with previous computational studies 61 .
Given that computational speed-up is the principal motivation for CG modeling, we evaluate trained FM and RE models for the real-world case of larger production CG MD time step sizes ∆t CG (fig. 4)  RE optimizes the empirical CG distribution p CG θ,∆t CG (R), we presume that the RE potential learns to correct for the time step-dependent terms in the CG shadow Hamiltonian. To test this hypothesis, we apply the RE models trained with 10 fs in CG MD simulations with ∆t CG = 2 fs. In line with our hypothesis, this combination yields larger MSE values than RE models with consistent time steps ( fig. 4): If the RE model learns to correct large time step-dependent terms in the shadow Hamiltonian, this biases CG simulations that exhibit only small time integration errors. Hence, RE minimization provides a means to mitigate the accuracy degradation of larger production time steps ∆t CG .

B. Alanine Dipeptide
Alanine Dipeptide 62,63 is a standard problem to benchmark CG methods in reconstructing a FES with multiple meta-stable states. We generate a 100 ns AT reference trajectory at T ref = 300 K from which a state is retained every 0.2 ps, resulting in 5 · 10 5 data points. The training data set consists of the first 80 ns, the FM validation set of the subsequent 8 ns and the final 12 ns form the test set. We select a CG mapping that retains all 10 heavy atoms of alanine dipeptide, but drops hydrogen atoms and water molecules. The CG particles representing CH 3 , CH and C are encoded as different particle types. Following the ∆-learning ansatz in eq. 12, we select a prior potential where we sum over all N bonds harmonic bonds with bond lengths b i , all N angles harmonic angles with triplet angles α j and all N dihedrals proper dihedral angles ω k . The dihedral force constant k ω , the multiplicity n, and the phase constant ω 0 are taken from the AMBER03 64 force field.
We train the FM model for 100 epochs with a batch size of 500 states and select the model that  Assuming that insufficient resolution of transition regions caused the suboptimal FES of the FM model, increasing the amount of training data should improve the FES: We generate a 1 µs AT trajectory, increasing the amount of training data by factor 10. Using this data set, the error of the FM model decreases as expected, but is still inferior to the RE model ( fig. 7). Note that numerical errors of the CG simulation also contribute to the FM MSE, which cannot be reduced by enlarging the training data set. To test the data requirement limits of RE, we reduce the 100 ns Finally, we test the robustness of both methods with respect to prior potentials by considering only harmonic bonds in U prior (R) (eq. 14). In this case, the FM potential significantly oversamples α L configurations ( fig. 9, supplementary fig. 8), despite superior validation force predictions (R 2 = 0.762) compared to the reference RE model above. Conversely, when optimizing the resulting FM model via 300 additional RE update steps, the dihedral density is in close agreement with the AT reference. Hence, RE minimization also helps correcting weak choices of prior potentials.

IV. DISCUSSION AND CONCLUSION
In this work, we have demonstrated the effectiveness of training CG NN potentials via the RE minimization scheme: For water, the difference between FM and RE minimization is significantly reduced when training CG NN potentials compared to classical 2-body CG potentials 60 . This is expected as the learned potential converges towards the PMF for increasing model capacity with both methods, given a sufficient amount of data 4,35,46 . For alanine dipeptide, RE results in a more accurate FES than FM. The discrepancy in the FES increases for decreasing quality of the prior to the point that the FM CG NN potential is no longer competitive with classical CG models. Sampling the CG model during training probes its robustness with respect to data generated onthe-fly. As a consequence, the RE training scheme can recognize and correct undesired model properties, which reduces the sensitivity on the prior potential.
For liquid water, RE allows larger time steps in subsequent CG MD simulations without compromising accuracy. We presume that RE is able to learn to correct the time integration error of the underlying CG MD simulation by training directly on the sampled CG distribution. Given that computational speed-up is the primary objective of CG modeling, a larger simulation time step is as important as an accurate approximation of the PMF. Consequently, CG NN potentials trained via RE may reach larger time scales in production CG simulations with less impact from time integration errors.
The advantages of RE minimization come at the cost of increased computational effort during training, which can however be reduced: As demonstrated in this work, pre-training via FM is a computationally efficient way to reduce the number of necessary RE updates through a better parameter initialization. Furthermore, histogram reweighting 31,42,[66][67][68][69] , frequently used in RE minimization 5,32,35 , is also applicable to NN potentials 27 . Reweighting allows previously generated trajectories to be reused, increasing the number of gradient descent steps per trajectory computation. Additionally, it seems reasonable to increase the trajectory length during the course of the optimization. In the beginning of training, where the model error significantly exceeds the statistical error of trajectories, short trajectories can save compute. Towards the end of training, longer trajectories with reduced statistical noise allow fine-tuning of the model. This scheme matches well with reweighting: Initial trajectories cannot be used for reweighting, irrespective of their length, due to large changes in the potential, while expensive trajectories towards the end of training may be reused for multiple updates. Moreover, we argue that in the realistic scenario of an expensive AT model with a, by design, orders of magnitude cheaper CG model, the computational bottleneck is AT training data generation rather than CG model optimization. Finally, NN potential architectures optimized for computational efficiency, such as the Ultra-Fast Force Fields 70 , are well-suited for CG applications. These architectures allow to capture many-body features of the PMF while reducing the computational overhead of the more expensive DimeNet++ 14 model used in this work. Evaluating the computational cost-accuracy trade-off between different compu-tationally efficient CG NN potentials and classical CG models is an interesting avenue of future research. Additionally, the merits of CG NN potentials should be examined for more complex systems than considered in this work.
The presented results can also be interpreted in terms of data efficiency. In CG applications, an accurate representation of the FES is usually of higher interest than accurate force predictions in energy minima. RE is well suited to reproduce the FES in practice by directly minimizing the difference between the potential energy surfaces of the AT and CG models (eq. 11). By contrast, FM requires a sufficient resolution of transition areas to learn a globally accurate FES 26 . This requires a large amount of AT training data, which is expensive to obtain. Long AT MD trajectories do not seem efficient in this regard due to repetitive sampling of energy minima and sparse sampling of high energy states 29 . Accordingly, enhanced sampling schemes such as Metadynamics 29,71,72 or Normal Mode Sampling 73,74 may improve data efficiency of FM by spreading sampling more evenly across the phase space.
In line with literature on simulation-based optimization schemes for classical CG models 2,6 , our results suggest that including MD simulations in the training process can be considered as a means to improve the reliability and accuracy of NN potentials, allowing to address recent concerns about their stability 75,76 . Active learning NN potentials 77,78 , recognized as a major building block in achieving stable and transferable models 22,79,80 , can be similarly interpreted as an incorporation of MD simulations into the FM training scheme: Performing MD simulations and screening visited molecular states for high-uncertainty configurations allows to augment the data set iteratively in phase space regions that are reachable by the NN potential but still sparsely represented in the data set. Alternatively, MD simulations can also be inserted directly into the training pipeline 27,[81][82][83][84] using auto-differentiable MD codes 82,84 . We expect that the benefits of using ML in simulations and, inversely, simulations for ML training will continue to drive the ongoing synthesis of ML and physical simulations in molecular modeling and beyond.

ACKNOWLEDGMENTS
The authors thank Dominik Blechschmidt for contributions to initial feasibility studies.