Targeted free energy estimation via learned mappings

,


I. INTRODUCTION
Free energy estimation is of central importance in the natural sciences.Accurate estimation of free energies, however, is challenging, as many systems are out of reach of experimental methods and analytic theory.Computer-based estimation has thus emerged as a valuable alternative.Successful application areas of in-silico free energy estimation span industry and scientific research, including drug discovery 1 , condensed matter physics 2 , materials science 3 , structural biology 4 , and the effects of mutagenesis 5 .Because of its importance and wide ranging applications, computer-based free energy estimation has been an active field of research for decades 6 .
At the core of many state-of-the-art estimators 7 lies the free energy perturbation (FEP) identity introduced by Zwanzig 8

in 1954:
E A e −β ∆U = e −β ∆F . ( Here ∆F = F B − F A is the Helmholtz free energy difference between two thermodynamic states A and B, each connected to a thermal reservoir at inverse temperature β and specified by a setting of external parameters λ A and λ B .We denote x as a point in the system's configuration space, and define with U A = U(x; λ A ) and U B = U(x; λ B ) the energy functions associated with A and B. By E A [• • • ] we denote an expectation under equilibrium distribution ρ A ∝ e −βU A (and similarly for B).A procedure for computation of ∆F via Eq.( 1) is then as follows: A number of samples are first drawn from ρ A , for example via a Monte Carlo chain or Molecular Dynamics simulation.The change in energy, ∆U, associated with instantaneously switching λ A → λ B is then computed, from which an exponential average is taken.From a statistical point of view, FEP is an application of Importance Sampling (IS), where ρ A serves as the proposal distribution [9][10][11] .
a) These two authors contributed equally and are correspondence authors: {pewi, aybd}@google.com.
While Eq. ( 1) is exact, the convergence of this estimator for a finite number of samples strongly depends on the degree to which A and B overlap in configuration space 12 .Indeed, the dominant contributions to the above expectation will come from samples of A that are typical under B, and such contributions become increasingly rare with decreasing overlap 13 .
There exist multiple strategies for mitigating the overlap requirement.Arguably the most common strategy is a multistaged approach, also known as stratification, in which a sequence of intermediate thermodynamic states is defined between A and B (Fig. 1a).Here the increased convergence is facilitated by demanding that neighboring pairs of states be chosen to contain sufficient overlap.The quantity of interest, ∆F, is then recovered as a sum over the pairwise differences ∆F i,i+1 6 .The Multistage Bennett Acceptance Ratio 7 (MBAR) estimator is a prominent example of an estimator that follows this strategy.However, multi-staged approaches require samples from multiple states as well as a suitable order parameter to define intermediate stages.Furthermore, it is unclear a priori how best to discretize the order parameter (for example how many stages to use).
An alternative, elegant strategy to increasing overlap is by incorporating configuration space maps.Jarzynski developed Targeted Free Energy Perturbation 14 (TFEP), a generalization of FEP whereby an invertible mapping defined on configuration space transports points sampled from A to a new distribution, A (see Fig. 1b).Jarzynski showed that a generalized FEP identity can be applied to this process (Eq.( 7) below) from which the free energy difference can be recovered.Importantly, if the mapping is chosen wisely, an effective overlap can be increased, leading to quicker convergence of the TFEP estimator.Hahn and Then extended TFEP to the bidirectional setting 15 , whereby the mapping and its inverse are applied to samples from A and B, respectively.Lower-error free energy estimates can then be obtained via the statisticallyoptimal BAR estimator 16 (Eq.(11) below).
Whether in the unidirectional or bidirectional case, the main challenge for targeted approaches is crafting a mapping that is capable of increasing overlap.Unfortunately for most realworld problems the physical intuition needed to develop such a technique is simply lacking.Modern-day machine learning (ML) techniques, however, seem perfectly suited for this task.
Since the introduction of TFEP in 2002, research in ML has made remarkable progress in fields of image classification 17,18 , playing video 19 and board games 20,21 , and generative modelling of images 22,23 .ML has also enabled advances in the natural sciences, including state of the art protein structure prediction 24 , neural-network based force fields for water 25 , generative modelling of lattice field theories 26 , new paradigms for sampling equilibrium distributions of molecules, and variational free energy estimates 27,28 .
In this work, we turn targeted free energy estimation into a learning problem.In lieu of a hand-crafted mapping, we represent our mapping by a deep neural network whose parameters are optimized so as to maximize overlap.Once trained, the free energy can then be computed by evaluating the targeted estimator with our learned mapping.Below we will consider both unidirectional and bidirectional settings, and will refer to them as Learned Free Energy Perturbation (LFEP) and Learned Bennett Acceptance Ratio (LBAR), respectively.
The rest of our manuscript is structured as follows.In §II below we summarize the previously-developed targeted free energy estimators that we will be making use of.This is followed by development of suitable training objectives to maximize overlap ( §III).We then demonstrate our method by applying it to a solvation system.In §IV, we describe the experimental setup and discuss inherent symmetries that are exploited to devise a model with the correct inductive biases ( §V).Finally, we present experimental results in §VI and discuss our findings in §VII.

II. TARGETED ESTIMATORS
In the following we will refer to A and B as the thermodynamic states, defined by equilibrium densities ρ A (x) = e −βU A (x) /Z A and ρ B (x) = e −βU B (x) /Z B , where Z A and Z B are the normalization constants (partition functions).In the targeted scheme, configurations x are drawn from A and mapped to new configurations y = M(x) via an invertible, user-specified mapping M. The set of mapped configurations can be thought of as samples from a new state A : Similarly, we also consider the reverse case where configurations are drawn from B and mapped to B via the inverse We refer to this pair of prescriptions as the "forward" and "reverse" process, respectively.For each process we denote generalized energy differences as where J M and J M −1 = J −1 M are the Jacobian determinants associated with the mappings.As originally shown by Jarzynski, an identity exists which relates ∆F to an ensemble of realizations of Φ F : Eq. ( 7) can be regarded as a generalization of FEP, as it holds for any invertible M, and reduces to Eq. ( 1) if M is the identity.An analogous equation holds for the reverse process.Derivations of Eq. ( 7) can be found in Refs.14 and 15 or Appendix A.
Hahn and Then extended the above result to the bidirectional case 15 , showing a fluctuation theorem (FT) exists between the forward and reverse processes: The functions and can be thought of as generalized work distributions associated with the mapping processes.With these bidirectional estimates, Bennett's Acceptance Ratio (BAR) method 16 can be employed as an alternative estimator of ∆F 15 .BAR estimation of ∆F can be formulated as a self-consistent iteration of the equation where f (x) = 1/(1 + e x ) is the Fermi function.For simplicity in Eq. (11) we restrict ourselves to the case where the number of samples in the forward and reverse directions are equal, but more general formulations exist.BAR has a statistical advantage over FEP as it has been shown to be the minimum variance free energy estimator for any asymptoticallyunbiased method 29 .Because of this property, BAR is generally the method of choice when samples from both A and B are available.
Crucially, the targeted estimators above hold for every invertible mapping.That is, given an infinite number of samples, any invertible choice of M will produce a consistent estimate of ∆F.Of course, the finite-sample convergence properties are of more practical importance and will strongly depend on the choice of M.

III. TRAINING OBJECTIVE
In a distributional sense, the forward and reverse processes act to transform ρ A and ρ B into ρ A and ρ B , as depicted in Fig 1 .In what follows we will refer to the distribution of mapped configurations as the "images" (i.e.ρ A and ρ B ), and the distributions we want them mapped towards as the "targets" (i.e.ρ A (ρ B ) for the forward (reverse) process).Due to the deterministic mapping, the bases and images are related by the change of variable formula, The crucial consideration for convergence of our estimators (Eq.( 7) and Eq. ( 11)) is the overlap between the image and target distributions 14 .Indeed, in the limit that images and targets coincide, Jarzynski showed 14 that p F (φ ) → δ (φ − ∆F).This implies that the convergence of Eq. ( 7) is immediate (i.e.only one sample is needed).In this limit, it is also the case that p R (φ ) → δ (φ + ∆F) implying that the expectation values on either side of Eq. ( 11) converge immediately.This overlap argument is further reinforced in the Appendix A, where we show that the TFEP estimator can be interpreted as a FEP estimator between A and B.
We now turn our attention to the construction of a loss function that accurately judges the quality of M. Guided by the considerations of overlap, we consider the Kullback-Leibler (KL) divergence between the image and target.For the forward process we have: In the above derivation, we invoked a change of variable formula in going from the first to third lines, and used the identity −β ∆F = log Z B − log Z A to get to the last.An analogous equation can be derived for the reverse process yielding While from Eqs. (14-15) it is clear that the KL cannot be accurately estimated unless ∆F is known, in terms of optimizing, ∆F and β can be disregarded as they are constants.
Below we consider two separate training regimes for our model.In the unidirectional case, the model was trained only on the forward process, with a loss function In the bidirectional case, the model was trained using both forward and reverse processes with the loss When samples from both states are available, bidirectional training is preferable.Unlike the unidirectional loss, L LBAR explicitly encourages both ρ A and ρ B to be mass-covering 30 , which is important for good performance of importancesampling estimators 31 .

IV. EXPERIMENTAL SETUP
To test our method, we consider a system similar to the one used by Jarzynski 14 consisting of a repulsive solute immersed in a bath of N = 125 identical solvent particles.The task is to calculate the free energy change associated with growing the solute radius from R A to radius R B (see Fig. 2).In contrast to a hard-sphere solute as used in Ref. 14, we modeled our solute as a soft sphere.This is because any finite particle overlap would lead to infinite forces, which our training method cannot handle.
Illustration of the simulation box.A solute particle (pink), located at the center, is grown from radius R A to radius R B thereby compressing the space accessible to the N = 125 solvent particles (grey).Opposite faces of the cubic simulation box with edge length 2L are associated with each other such that a particle gets inserted again on the opposite side as it tries to leave the reference box (blue).The reference box is therefore topologically equivalent to a torus.
Intuitively, an effective mapping should push solvent particles away from the center to avoid high-energy steric clashes with the expanding solute.Jarzynski followed this intuition, defining a mapping that uniformly compresses the solvent particles amidst an expanding repulsive solute.Although this mapping gave a significant convergence gains when applied to a hard solute 14 , it is not directly applicable to soft solutes.This is because the phase space compression results in a transformed density whose support is not equal to that of the target density, violating the assumption of invertibility.
Below we demonstrate that we no longer need to rely on physical intuition to hand-craft a tractable mapping; this process can be fully automated using the general framework proposed in this work.In order to learn an effective mapping, however, it is crucial that the model be compatible with the inherent symmetries of the underlying physical system.

A. Energy
Periodic boundary conditions (PBCs) confine the N-particle system to a 3N-dimensional torus, x = (r 1 , . . ., r N ) ∈ T 3N , where each coordinate of the position vector r i ∈ T 3 of particle i lives in the interval [−L, L] (see Fig. 2).The total energy can then be decomposed into a sum of pairwise contributions according to where subscript α ∈ {A, B} denotes the state and r i j = r j − r i .The quantity r i j = r i j − 2L round(r i j /(2L)) is a difference vector whose components correspond to the shortest, signed distance in the respective dimension, and the function round is applied element-wise giving the nearest integral number.The radially symmetric functions u(r) and v(r) represent the Lennard-Jones 32 (LJ) and Weeks-Chandler-Andersen 33 (WCA) potentials.In practice, we truncate LJ interactions using a radial cutoff of L and shift the potential to be zero at the cutoff 34 .

B. Training data
The data used for training and evaluation was generated via molecular dynamics (MD) simulations using the package LAMMPS 35 .Simulation frames were generated via Langevin dynamics and were saved infrequently enough to ensure decorrelated samples (as judged by the potential energy decorrelation time), giving a total of 5 × 10 4 frames per simulation.A total of 20 independent simulation trajectories were generated for each thermodynamic state, starting from randomly-intialized configurations and momenta.Ten independent training and evaluation datasets were then constructed from these simulations by first concatenating and shuffling configurations, and then partitioning them into N train = 9 × 10 4 training and N test = 1 × 10 4 test samples for each dataset.The value of N train was chosen such that the baseline BAR estimator, when evaluated on N train data points, yields a reasonably accurate free energy difference.Below we train and evaluate our model on these datasets so as to compute statistical convergence properties of the estimators across independent runs.Further simulation details are summarized in Appendix §B.

C. Symmetries
The system under consideration exhibits properties that are widely encountered in the atomistic simulation community: PBCs and permutation invariance.PBCs are usually employed to reduce finite-size effects.Permutation invariance arises as a consequence of the energy being invariant to particle permutations-a condition that is satisfied by the solvent particles as they are all identical.
PBCs are a choice of geometry for the space in which particles live: a 3D torus.Without them, the system (including the solute) would admit rigid rotation, reflection and rigid translation symmetries.With them, only the translation symmetries remain, and a discrete set of rotations/reflections.The original rigid translations in R 3 become translations by elements of the torus T 3 , leaving the energy invariant.Because of this symmetry, we can fix the solute at the origin of the simulation box without affecting the ratio Z B /Z A (as shown in Fig. 2).The remaining set of discrete rotations/reflections is the group of symmetries of a cube (the octahedral group), which contains only 48 elements.
Our choice of model architecture respects PBCs and permutation invariance, as we will discuss in the next section in detail.We also tested other architectures, and found that respecting PBCs and permutation invariance led to best performance (data not shown).Our chosen model architecture does not obey the octahedral symmetries, but we do not expect this to have a big impact because of the octahedral group's relatively small size compared to T 3 (infinite) and to the group of particle permutations (of size N!).

V. MODEL
We implement the mapping M using a deep neural network, parameterized by a set of learnable parameters θ .In designing the architecture of the network, we take into account the following considerations.
(a) The mapping M must be bijective, and the inverse mapping M −1 should be efficient to compute, for any setting of the parameters θ .
(b) The Jacobian determinant J M should be efficient to compute for any setting of θ .
(c) The network should be flexible enough to represent complex mappings.
(d) The transformed distributions ρ A and ρ B should respect the boundary conditions and symmetries of the physical system.
The first three requirements are satisfied by a class of deep neural networks known as normalizing flows 36 , which are invertible networks with efficient Jacobian determinants.Since bijectivity is a closed property under function composition, multiple normalizing flows (or "layers") can be composed into a deeper flow, yielding a model with increased flexibility.We implement M as a normalizing flow composed of K invertible layers, that is, Each layer M k : T 3N → T 3N is of the same architecture but it has its own learnable parameters θ k , and the learnable parameters of M are simply θ = (θ 1 , . . ., θ K ).
Our implementation of M k is based on the architecture proposed by Dinh et al. 37 , which is often referred to as the coupling layer.Let r ν i , ν ∈ {1, 2, 3}, be the spatial coordinates of the particle with position vector r i ∈ T 3 .To simplify notation, we will also use r ν i to denote the inputs to layer k (that is, the particles transformed by , and drop the dependence on k.Let I k be a subset of the indices {1, 2, 3} associated with layer k.Then, M k is defined as follows: where By r i we refer to the set of coordinates of r i that are indexed by I k , and C ν i is the output of C for coordinate ν of particle i.The functions G and C are implemented by neural networks.The parameters of G associated with coordinate ν of particle i are denoted by ψ ν i , and are computed by C. The parameters of C are the learnable parameters of the layer, θ k .
In simple terms, M k works as follows.We partition the coordinates into two sets; the coordinates indexed by I k are left invariant, whereas the remaining coordinates are transformed element-wise.Each transformed coordinate undergoes a different transformation depending on the value of ψ ν i , which itself is a function of all the coordinates that remain invariant.To ensure that each coordinate gets the opportunity to be transformed as a function of every other coordinate, each layer M k uses a different partition I k , and we cycle through the partitions {1}, {2}, {3}, {1, 2}, {2, 3} and {1, 3} across layers.
For the mapping M to be bijective, a sufficient condition is that G(•; ψ) : [−L, L] → [−L, L] be strictly increasing for any setting of ψ.In that case, the inverse M −1 k is obtained by simply replacing G with G −1 in Eq. ( 20), and the Jacobian determinants can be computed efficiently as follows: Finally, the inverse and Jacobian determinant of the composite mapping M can be computed by To ensure that the transformed distributions ρ A and ρ B obey the required boundary conditions, the implementation of G must reflect the fact that r ν i = −L and r ν i = L are identified as the same point.For this to be the case, a sufficient set of conditions is the following: for any setting of the parameters ψ.To satisfy the above conditions, we implement G using circular splines, which were recently proposed by Rezende et al. 38 and are based on the rational-quadratic spline flows of Durkan et al. 39 .Our implementation of the circular splines is detailed in Appendix §C.
In addition to the above, we also need to make sure that the parameters ψ ν i in Eq. ( 21) are periodic functions of the invariant coordinates r This can be easily achieved by the following feature transformation: where cos and sin act element-wise.The above feature transformation is injective, so no information about r i is lost.We apply this feature transformation to each particle i at the input layer of network C.
Finally, to ensure that the transformed distributions ρ A and ρ B are invariant to particle permutations, it is necessary that the Jacobian determinant J M also be invariant to particle permutations.In our architecture, this can be achieved by taking C to be equivariant to particle permutations.Specifically, let σ be a permutation of the set {1, . . ., N}.We say that C is equivariant with respect to σ if that is, if permuting the particles has the effect of permuting the parameter outputs (ψ ν 1 , . . ., ψ ν N ) in exactly the same way.From Eq. ( 22), we can easily see that the above property implies that σ leaves J M k , and hence J M , invariant, because we sum over all particles and the sum is permutation invariant.Previous studies have made similar observations 40,41 .Our implementation of C is based on the architecture proposed by Vaswani et al. 42 , often referred to as the transformer, which we use in a permutation-equivariant configuration.The implementation details of our transformer architecture are in Appendix §C.

VI. RESULTS
In this section, we evaluate the performance of our method for the solvation system illustrated in Fig. 2. We focus on the bidirectional BAR and LBAR estimators in the main text, due to their advantages over unidirectional approaches as discussed in §II.We refer to Appendix §D for a discussion of the unidirectional counterparts.
To capture statistical variation, our training and analysis procedure was performed 10 times, each using independent training and evaluation datasets.Our loss profiles and free energy estimates below report averages as well as statistical variation across these runs.17)] is evaluated for training (solid lines) and test (dashed lines) datasets.Thin lines correspond to the individual runs, each trained on an independent dataset, and thick lines correspond to their respective averages.The loss keeps decreasing for the training set but exhibits a minimum for the test set at around 5 × 10 5 steps (vertical line).This is roughly the point where the model starts to overfit to the training data.
We first report training results in Fig. 3, where the full-batch loss is plotted as a function of the number of training steps.We observe a pattern commonly encountered in ML: after an initial decrease of both training and test loss, the latter develops a minimum.At around the minimum, the model stops generalizing and starts to overfit to the training data.We therefore employ a technique called early stopping 43 and use the model parameters corresponding to the minimum test loss for all further evaluations.It is worth emphasizing, however, that the precise location of the minimum does not have an appreciable effect on the quality of the free energy estimates reported for the bidirectional estimator (results not presented).We also note the small variation among the independent runs, suggesting that there is no significant dependence of the performance on a particular dataset.
Next, we probe the overlap resulting from our learned mapping.In Fig. 4a we plot the distributions of learned work values for the forward and reverse directions compared against their unmapped counterparts for which the mapping is the identity.These distributions are typically unimodal 6 and satisfy the inequalities for any invertible mapping.The free energy difference ∆F corresponds precisely to the point of intersection of the forward and reverse distributions.Both of these facts are a consequence of the fluctuation theorem [Eq.(8)].Intuitively, it is clear then that an effective mapping should increase the overlap between the distributions to facilitate locating the intersection.This is precisely the enhancement we observe for the mapped work values.The two modes are shifted towards each other and share a significantly larger overlap than the unmapped distributions.We also see that the mapping strongly reduces the variance of the distributions.In fact, with a perfect mapping we would expect the forward and reverse distributions to collapse onto a single delta distribution located at ∆F.
We now turn to the statistical convergence of the free energy estimates.In Figure 4b we plot a running average of the estimate as a function of the number of evaluated samples per stage.The solid lines report averages of the estimate over the independent runs, and shaded regions represent one standard deviation of the runs.To test the correctness of our implementation, we first compare our estimates against a converged multistage MBAR estimator which employed 15 stages and 7.5× more samples in total.From the figure we see that the variation of our estimate overlaps nicely with the MBAR error estimates.When comparing to the baseline BAR estimator, we see in Fig. 4b clear variance reduction of LBAR across a wide range of evaluated sample sizes.The full-batch LBAR standard deviation we observe is approximately 18% of that reported by BAR.Moreover, training and evaluation of the model occurred on the same dataset, demonstrating that an effective mapping can be learned in a data-efficient manner.This is an important practical consideration but not at all obvious a priori.We could, in principle, even combine samples from the training and test datasets for estimation of ∆F but have used the test set only to detect overfitting.

VII. DISCUSSION
In this work, we turn TFEP into a learning problem by combining it with state-of-the-art ML techniques.TFEP previously required hand-crafting a tractable mapping on configuration-space, a significant challenge for many realistic systems.We proposed to represent the mapping by a suitable neural network, and identified training objectives for unidirectional and bidirectional cases.We then tested their performance on a fully periodic system.This type of boundary condition is routinely used in atomistic simulations but poses a challenge for ML models, as it requires careful design of the flow architecture.Our experimental results suggest that both LFEP and LBAR estimators lead to a significant variance reduction compared to their respective baselines.Furthermore, we found that the LBAR method is even competitive with a multi-staged MBAR estimator while requiring only a small fraction of the data.Our results therefore clearly highlight the potential of this approach and suggest that ML could eliminate the need for staging altogether.
Improving TFEP via learned mappings relates to the general idea of improving importance sampling by learning the proposal distribution, which has been explored substantially in machine learning and statistics.For instance, recent works in machine learning have proposed training a flexible deeplearning model of the proposal distribution to improve importance sampling 44 or more sophisticated variants such as bridge sampling 45 and sequential Monte Carlo [46][47][48] .In turn, these approaches can be traced back to methods for adaptive importance sampling 49 and adaptive sequential Monte Carlo 50 in statistics.One recent instance of these approaches that relates closely to our work is Neural Importance Sampling 44 , which uses expressive normalizing flows to learn good proposal distributions for importance sampling.Many of the above works have noted that the choice of loss function is important, with the forward KL divergence and chi-squared divergence being standard choices.These observations are in line with our observations of the differences between the unidirectional and bidirectional training losses.Finally, we note that the ideas presented in this work can also be combined with Boltzmann Generators 28 .Since our estimators are asymptotically unbiased, they could yield much more accurate free energy estimates than the variational approximations reported in Ref. 28. Merging both ideas would be an interesting extension for future work.
We implement G using circular splines 38 so that G satisfies the boundary conditions in Eqs.(25-26).Briefly, G is a piecewise function that consists of S segments.The parameters ψ are a set of S + 1 triplets (x s , y s , d s ), where [x s−1 , x s ] defines the domain of segment s, [y s−1 , y s ] defines its image, and d s−1 , d s define its slopes at the endpoints (all slopes are required to be positive).Each segment is a strictly increasing rationalquadratic function, constructed using the interpolation method of Gregory and Delbourgo 52 .The conditions in Eqs.(25-26) are satisfied by setting x 0 = y 0 = −L, x S = y S = L and d 0 = d S > 0. We can increase the flexibility of G by increasing the number of segments S. With a large enough S, circular splines can approximate arbitrarily well any strictly increasing function from [−L, L] to itself that satisfies Eqs.(25-26).We implement C using the transformer architecture of Vaswani et al. 42 so that C satisfies the permutationequivariance condition in Eq. (28).Briefly, the network architecture is as follows.Each input r I k i undergoes the feature transformation in Eq. ( 27), and is then mapped to a fixed-sized vector h i via an affine transformation identically for each i.Then, (h 1 , . . ., h N ) is processed by a sequence of transformer blocks, each of which is composed of two residual layers 18 .The first residual layer is where MHA is a multi-headed self-attention layer as described by Vaswani et al. 42 , and MHA i is its i-th output.The second residual layer is where MLP is a standard multi-layer perceptron 53 that is applied identically for each i.After repeating the transformer block a specified number of times (each time with different learnable parameters), each vector h i is mapped to the output (ψ ν i ) ν ∈I k via an affine transformation identically for each i.To help with generalization and stability, we applied layer normalization 54 to the inputs of MHA and MLP.Because MHA is permutation-equivariant and every MLP and affine transformation is applied identically for each i, it follows that C is permutation-equivariant, and therefore the transformed distribution is permutation-invariant as desired.
All models were trained using the Adam optimizer 55 .A summary of the hyperparameters is provided in Table I.Running averages of the ∆F estimate as a function of the number of evaluated samples per stage for the FEP (dotted line), MBAR (solid line) and LFEP (dashed lines) estimators.For the latter, the numbers within parentheses correspond to the training steps at which we evaluated the mapping.The lines and shaded regions of FEP and LFEP estimates correspond to average and one standard deviation over the runs.
decreasing, the KL already exhibits a pronounced minimum due to a drift of the ∆F estimate towards lower values.This is further illustrated in Fig. 5b, which compares the convergence of ∆F for three different mappings corresponding to specific training steps (colored symbols in Fig. 5a).We see that the variance is reduced significantly in all three cases.However, only the first mapping (training step 2 × 10 4 ) agrees with the MBAR baseline, while the other two mappings exhibit a noticeable bias.This is problematic in that the minimum of the KL would be a natural point to stop training but does not yield the lowest bias.This is also in contrast to our findings for the bidirectional case, where the quality of the mapping was fairly insensitive to the stopping point.One possible explanation for these observations is the well-known zero-forcing property 30 of L LFEP .That is, L LFEP does not encourage the transformed distribution to be mass-covering, which is known to negatively impact the performance of importance-sampling estimators 31 .

FIG. 1 .
FIG.1.The overlap problem.Efficient estimates of free energy differences rely on a decent configuration-space overlap between equilibrium distributions.Each panel represents a set of distributions on configuration space, with the circles indicating regions of appreciable mass.(a) In the commonly-used multi-staged approach, a chain of intermediate states is defined between A and B such that a decent overlap exists between each neighboring pair.Free energy differences are computed with respect to each neighboring pair, and summed to obtain the difference of interest.(b) In the targeted approach, an invertible mapping M transports the equilibrium distribution A to a new distribution A .Convergence is improved upon if the mapping results in an increased overlap with B.

FIG. 3 .
FIG.3.Loss profile for bidirectional training.The bidirectional loss [Eq.(17)] is evaluated for training (solid lines) and test (dashed lines) datasets.Thin lines correspond to the individual runs, each trained on an independent dataset, and thick lines correspond to their respective averages.The loss keeps decreasing for the training set but exhibits a minimum for the test set at around 5 × 10 5 steps (vertical line).This is roughly the point where the model starts to overfit to the training data.

FIG. 4 .
FIG. 4. Enhanced convergence of the learned LBAR estimator.(a) Normalized histograms of forward and reverse work (β φ ) values with (blue, solid line) and without (red, dashed lines) the mapping.The vertical line indicates the ground-truth free energy estimate computed by MBAR.(b) Running averages of the ∆F estimate as a function of the number of evaluated samples per stage for the BAR (red, dashed lines), LBAR (blue, solid lines) and MBAR (green, dash-dotted lines) estimators.The lines and shaded regions of BAR and LBAR estimates correspond to average and one standard deviation over the 10 independent runs.The vertical green bars report statistical error of the MBAR estimator 7 .

FIG. 5 .
FIG. 5. Enhanced convergence of the learned LFEP estimator.(a) Estimates of the loss L LFEP (solid lines) and D KL (dashed lines), averaged across 10 different datasets, as a function of training progress for test (grey) and training (blue) datasets.Shaded regions correspond to one standard deviation estimated across the independent runs.Colored diamonds highlight selected training steps for evaluation of ∆F.(b)Running averages of the ∆F estimate as a function of the number of evaluated samples per stage for the FEP (dotted line), MBAR (solid line) and LFEP (dashed lines) estimators.For the latter, the numbers within parentheses correspond to the training steps at which we evaluated the mapping.The lines and shaded regions of FEP and LFEP estimates correspond to average and one standard deviation over the runs.

TABLE I .
Model and training hyperparameters.