On the accuracy of one- and two-particle solvation entropies

Evaluating solvation entropies directly and combining with direct energy calculations is one way of calculating free energies of solvation and is used by Inhomogeneous Fluid Solvation Theory (IFST). The configurational entropy of a fluid is a function of the interatomic correlations and can thus be expressed in terms of correlation functions. The entropies in this work are directly calculated from a truncated series of integrals over these correlation functions. Many studies truncate all terms higher than the solvent-solute correlations. This study includes an additional solvent-solvent correlation term and assesses the associated free energy when IFST is applied to a fixed Lennard-Jones particle solvated in neon. The strength of the central potential is varied to imitate larger solutes. Average free energy estimates with both levels of IFST are able to reproduce the estimate made using the Free energy Perturbation (FEP) to within 0.16 kcal/mol. We find that the signal from the solvent-solvent correlations is very weak. Our conclusion is that for monatomic fluids simulated by pairwise classical potentials the correction term is relatively small in magnitude. This study shows it is possible to reproduce the free energy from a path based method like FEP, by only considering the endpoints of the path. This method can be directly applied to more complex solutes which break the spherical symmetry of this study.


I. INTRODUCTION
The ability to estimate the free energy difference between two defined states is a useful tool in computational chemistry. Direct applications are predicting the free energy of a solvation process, whether it may be testing a small molecule in a solvent to see if the pair is likely to be miscible 1 or a larger system, for example, a peptide, protein, 2 or protein-ligand complex. 3 The latter has a direct implication to in silico drug designs. Being able to quantitatively measure such a change then allows the relative comparison of ligands for a given protein. 4 For the protein-ligand complex, if the free energy of the bound state is less than the unbound state then the equilibrium will favour the bound state.
There exist a number of methods of estimating changes in the free energy with computational simulations. These include the Free Energy Perturbation (FEP) 5 and Thermodynamic Integration (TI). 6 Both these methods rely on a well defined path from the reference state to the new state, but are very generally applicable and work with different levels of theory for the description of the physical system. 7 Another method which has seen growing success is Inhomogeneous Fluid Solvation Theory (IFST), [8][9][10] which does not need a well-defined path between the reference state and the new state; however, IFST can only calculate changes in the free energy in the a) Electronic mail: djh210@cam.ac.uk. URL: http://huggins-lab.tcm.phy.cam. ac.uk/.
context of a solvation process. IFST performs this by using a direct computation of the change in entropy due to solvation and combining this with a direct energy measurement 11,12 to calculate the free energy. There are numerous examples of calculating and estimating such a change in free energy. [13][14][15][16][17][18] This work attempts to expand on one method of measuring the solvation entropy change directly, namely the Mutual Information Expansion (MIE), 19 and uses a k-Nearest Neighbours (KNN) 20,21 estimator to evaluate an approximation to the change in the solvation entropy. This work is different from the majority of previous works, in that we truncate the MIE at a higher order, including an additional term. This additional term represents correlations between two solvent molecules in the presence of a solute, and such terms have been measured before in the context of water in protein binding pockets 22 using Grid Inhomogeneous Solvation Theory (GIST). 23, 24 We seek to find quantitatively, the change in free energy associated with the extra information, and whether it is necessary to include this term in the MIE. The system under study in this work is fundamental and simple, a Lennard-Jones neon solvent with a fixed Lennard-Jones atom in the centre of the simulation cell which represents the solute in a solvation process. The changes in free energy will be compared to an equivalent FEP simulation. The comparison of FEP and IFST in this work is independent of the forcefield used.
First, we will discuss the theory used in all calculations, then review the computational methods used, and input simulation parameters. The results will then be discussed and then analysed.

II. THEORY
Our calculations of the solvation free energy associated with fixing a Lennard-Jones atom at the centre of a neon simulation box (see Fig. 1) are based on the following: Step 1. A change in free energy in the canonical ensemble is given by a change in the Helmholtz free energy, where ∆U is the change in the internal energy of the system, T is the temperature of the system, and ∆S is the change in the entropy of the system. The changes here are the differences between the solute-liquid system and the bulk liquid system.
Step 2. By using a molecular dynamics (MD) simulation with a parameterized forcefield, it is possible to estimate ∆U as whereŪ 1 is the equilibrium expectation energy of a system of N neon atoms in a periodic cell of volume V at temperature T, andŪ 2 is the equilibrium expectation energy of a system of N neon atoms in the presence of a fixed Lennard-Jones atom at volume V and temperature T.
Step 3. It is possible to write the total change in solvation entropy, ∆S in Eq. (1), as an expansion over correlations of one-body, two-body, three-body, and so on, as discussed by Baranyai and Evans 25 for the homogeneous fluid and by Lazaridis 8,9 for the inhomogeneous fluid. The theory for homogeneous fluids was extensively studied by Kirkwood,6,26 Nettleton, 27 Raveche, 28 and Wallace. [29][30][31] For the system of N ideal atoms with coordinates {r} N = {r 1 , . . . , r N } and momenta {p} N = {p 1 , . . . , p N }, we have the N body distribution in positions and momenta and the total entropy of a bulk fluid is given by with R the gas constant and h the Planck constant. Then, the separability of the momentum can be exploited to give which is written explicitly as The momentum terms are the same as those of an ideal gas where the one-body distribution of momenta is given by where k is the Boltzmann constant, ρ is the number density of the equivalent ideal gas, and m is the mass of the atom. Then, with λ the thermal wavelength of an atom in the liquid. For a vector of random variables X = (X 1 , . . . , X n ), we may write 32 where H is an information entropy and the notation H(X|Y ) is the conditional entropy of X given Y. In a similar fashion, we may write giving S liquid = S momentum + S 2 liquid − I 3 liquid + I 4 liquid − · · · (11) This expansion was performed for a homogeneous fluid by Wallace. It applies to the canonical ensemble and is non-local, meaning that it is only exact when integrated over the entire system volume. 25,29 An ensemble invariant and a local form of the expansion are discussed by Baranyai and Evans. 25 In the conditions of this study, the distinguishing terms between the local and non-local forms cancel, so here the non-local form is used for simplicity. In this case, the expansion has the following terms: N (r 1 , r 2 ) ln g (2) N (r 1 , r 2 ) dr 1 dr 2 , (13) We can write the part of the three-body correlation function which cannot be expressed multiplicatively by its marginal distributions as where each of these N particle correlation functions for k bodies can be written in terms of the k body density as 33 g N (r 1 , r 2 , . . . , r k ) = 1 ρ k ρ N (r 1 , r 2 , . . . , r k ).
Finally, we may write the excess entropy of the liquid as we have Therefore, where terms with an S denote an entropy and terms with an I denote a mutual information term. The expansion with these terms is true for a homogeneous fluid. For the simulations used in this work, we include the presence of the central solute. 8 The central solute in our calculations is spherically symmetric, and our end goal is to generalise to any solute and solvent, so we must consider an inhomogeneous system. This was analytically performed by Lazaridis. 8,9 Eqs. (12)- (14) are the relevant terms for the liquid, and for a system with a solute we may write where the |s argument indicates the presence of a solute. We can write the part of the two-and three-body correlation functions which cannot be expressed by its marginal distributions as Some slight differences exist between the homogeneous and inhomogeneous formulations, S 2 liquid is not a mutual information term (denoted with an I rather than an S), as the marginal distributions vanish. We may then write and subtracting the ideal gas terms, Eq. (18), S excess solute = S 1 solute − I 2 solute + I 3 solute − · · · − R. (26) To find the change in entropy from the addition of the solute, we calculate the excess entropy of solution, ∆S exc soln . This is then the excess entropy of the solute system, Eq. (26), minus the excess entropy of the neat liquid, Eq. (19), where we see the factors of R from S excess solute and S excess liquid cancel. It is at this point we choose a truncation of ∆S exc soln . The most severe truncation generates what we will call the conditional one particle entropy (C1PE), This is achieved by removing all integrals with a subscript greater than 1. If we include the two-body terms, then we have the conditional two particle entropy (C2PE), It is ∆S 2 |s that this work is concerned with calculating.
Step 4. Instead of directly integrating numerically the terms in Eqs. (29) and (30), we can instead convert the integral into a sum which converges to the integral asymptotically in the limit of infinite data. We can then extrapolate toward that infinite limit by measuring the sum for finite quantities of data and fitting an appropriate extrapolating function. The estimators we use in this study are KNN estimators and are formulated as follows.
Inserting the explicit integral terms into Eq. (30) gives we may separate out the denominators of the logarithm in the second term and then swap the coordinates r 1 and r 2 in one of those new integrals to give and it is then possible to integrate over r 2 in the third term giving It is convenient to construct estimators to measure three distinct quantities N (r 1 , r 2 |s) ln g (2) N (r 1 , r 2 |s) dr 1 dr 2 , (35) giving an expression for the conditional one-and two-particle entropies in terms of these estimators In general, we may state the information entropy of a probability density function over p random variables as the p-dimensional integral as the expectation However, in Eqs. (34)-(36), we have correlation functions in the logarithm rather than densities. We may then instead calculate This quantity can then be approximated by a finite sum which is performed in Sec. II A.

A. k-Nearest Neighbours (KNN) estimators
To efficiently calculate the integrals in Eqs. (29) and (30), we used the k-nearest neighbours method which has shown previous success in calculating first order solvation entropies, 10,13,23 dihedral entropies in small molecules, 34 and entropies of water in protein binding pockets. 14,22 The general estimator for N p , p-dimensional objects across F frames of data is given by where Γ(x) is the Euler gamma function, ψ(k) is the Euler digamma function, and the symbol denotes an asymptotic equivalence in the limit of an infinite number of frames F. This estimator was initially worked on for the first nearest neighbour by Leonenko 20 and extended for the kth nearest neighbour by Singh 21 and has been used by various studies. 10,13,14,19,34,35 This estimator is a limiting case of the adaptive anisotropic elliptic kernel estimators. 36,37 The estimator for an H 1 term, given F sufficiently uncorrelated MD frames containing N neon atoms, is given by the expression which is used in Refs. 13 and 14. The next order expression is then In previous work, Eqs. (43) and (44) have been shown to fit a power law for increasing values of F 13,14,19 with a k and b k some constants for the kth neighbour selected, and H ∞ the asymptotic value of the entropy. b k is a negative constant such that To extract this value, a power law can be fitted to H(F).

III. SIMULATION DETAILS
FEP and IFST were used to calculate the free energy change associated with adding a fixed Lennard-Jones particle to the centre of a neon simulation box in the canonical ensemble. The IFST free energy with only ∆S 1 |s and with both ∆S 1 |s and ∆S 2 |s were calculated. All of the simulation data used in this study came from the same forcefield and MD simulations were all carried out in NAMD. 38 Thus, the resulting free energies are directly comparable.

A. Systems setup
The neon MD parameters were taken from the CHARMM27 forcefield. 39 There are no electrostatic interactions for this solvent-solute system, so all potential energy terms are in the form of a Lennard-Jones potential where r ij is the distance between atoms i and j, and R k is the radius of minimum potential energy for atom k, R k = 2 1/6 σ k . 4 different solutes were taken which had varying Lennard-Jones ε s parameters as shown in Table I. All solutes had the same R k parameter as the neon solvent. Each solute was initially solvated with 900 neon atoms in a cubic box of edge length 27.5 Å. Where reduced units are given for reference they are calculated as the temperature T * = k B T /ε, pressure P * = pσ 3 /ε, length L * = L/σ, and number density ρ * = ρσ 3 . For the 12-6 LJ fluid, the triple point density of both the liquid ρ * tl and solid ρ * ts phases have been measured by previous studies to be in the ranges ρ * tl = 0.818-0.864 and ρ * ts = 0.96-0.978, and the triple point has approximate temperature and pressure T * = 0.661 and p * = 0.0018, respectively. 40 Thus, the simulated neon was in the liquid phase. 16 replicate equilibrations per solute type were performed in the NpT ensemble for 4 ns to find the natural densities of the simulations cells with each solute present. The resulting edge lengths are shown in Table I. A 2 ns equilibration was then performed for each replicate in the NVT ensemble at T = 25 K (T * = 0.578) and 1 atm (p * = 0.003 44) using the Langevin temperature control. An MD timestep of 2.0 fs was used. Electrostatic interactions were turned off as they were not present in the forcefield. van der Waals interactions were removed for separations over 10.5 Å, and switching was used between 9.5 Å and 10.5 Å. The simulations took place with cubic periodic boundary conditions. This process was repeated for each of the 4 solutes and one bulk system with no solute.

IFST production
The production simulations were carried out in the NVT ensemble for 60 ns at 25 K (T * = 0.578) with the same MD parameters as the NVT equilibration. NAMD produces a trajectory (dcd) file, which is a set of system coordinate snapshots. A trajectory frame was saved every 500 fs such that neighbouring frames in the dcd file were not too strongly correlated; this is important for the KNN estimator when used later to extract entropies as commented in the previous work by Huggins. 13 The energies ∆U =Ū solute −Ū bulk were calculated by taking the average energy across all 16 repeats of the 60 ns production simulation, and these are shown in Table II along with the standard deviation across 16 repeats, σ ∆U .

IFST free energy calculations
To calculate the change in free energy for IFST simulations, we need to calculate each component on the right hand side of ∆A 1|s = ∆U − T ∆S 1 |s (48) ∆U was calculated using Eq. (2), and therefore the quantities ∆S 1 |s and ∆S 2 |s must be calculated using Eqs. (37) and (38). This then requires collecting a set of nearest neighbour distances and nearest pair distances from the respective MD data for the entropy estimators.

C. K-Dimensional (K-D) trees
Previous studies using IFST or GIST have used a grid of voxels or cut-cell approach to find nearest neighbour distances. 13,14 For the pair nearest neighbour distances required for Eq. (44), this method was not practical. In order to increase the efficiency and allow the calculation of the pair terms, a K-Dimensional (K-D) tree method was used. This is covered in more detail in the supplementary material.

FEP protocol
The FEP simulation measures the change in free energy of removing a fixed solute from a box of neon. This simulation is parameterized by a variable λ such that when λ = 0, the solute is fully present and when λ = 1, the solute is fully annihilated.
In this study, each forward and backward FEP simulation had N = 64 "λ-windows," and each window has a different value of lambda which is labeled as degree of annihilation in the schedule displayed in Fig. 2. For the nth window out of a total of N windows, the value of λ in that window was generated with the expressions which satisfy λ f (0) = 0,λ f (N) = 1,λ b (0) = 1, and λ b (N) = 0, where λ f (n) and λ b (n) are the forward and backward schedules, respectively. These curves were picked to sample the endpoints of the FEP simulation more heavily to "avoid the end point catastrophe." 41 The endpoints are when the solute is close to full annihilation in the forward simulation (n ≈ N) or when the solute is just being created in the backward simulation (n ≈ 0). A van der Waals soft core potential parameter of 5.0 was used. One energy reading was stored every 100 steps which equates to every 0.2 ps.

FEP equilibration
The end points of the NVT equilibrations for the IFST simulations were used as starting points for the FEP simulations. Each solute had 16 replicates. Each simulation underwent a further 0.2 ns equilibration at T = 25 K (T * = 0.578) from this point to adjust to their individual λ values.

FEP simulation
For each λ-window, 0.469 ns of simulation was performed in the NVT ensemble at T = 25 K (T * = 0.578). Then, considering the 64 windows in both directions the overall time was 60.1 ns, which is comparable to the 60 ns used for the IFST production run.

FEP calculations
The final change in the free energy from the FEP simulations is found by summing the changes in free energy between each pair of neighbouring λ windows. Then, the forward change is and the backward change is To calculate the changes in free energy between the λwindows, the Bennett Acceptance Ratio (BAR) method was used, 42 which is included in the ParseFEP Plugin 43 for Visual Molecular Dynamics (VMD) 44 in which all FEP results for this study were calculated. The BAR estimator provides a calculated statistical error, and these errors were less than 0.007 kcal/mol for all of the 64 simulations.

A. Results
Fig . 3 shows the solute-fluid radial distribution functions for all solutes and bulk neon. Fig. 4 shows a comparison of the free energy estimates from both levels of the IFST results and from FEP.  4. The free energy estimates for the two IFST approximations and the FEP result at ε s values of 2.5ε n , 5.0ε n , 7.5ε n , and 10.0ε n . Different levels of theory have been spaced for clarity. C1PE is the conditional one-particle entropy correction. C2PE is the conditional two-particle entropy correction. The error bars are the spread in the 16 repeats of each result. Table II shows the comparison of the average free energies from FEP and IFST calculations, and the IFST calculations show the conditional one-particle and conditional two-particle entropy values.
The IFST free energies in Table II were calculated by Eqs. (48) and (49), the entropy terms were calculated by extrapolating Eq. (45) for the conditional one-particle entropies every 1000 frames from 1000 to 12 000, and Eq. (45) for conditional two-particle entropies, 16 repeats were taken at intervals of 1000, 2000, and 3000 frames and 2 repeats were extended to 1000, 2000, 3000, 4000, 5000, 7500, and 10 000 frames. Fig. 5 shows an example of the conditional two-particle entropy extrapolation for all solutes. The extrapolation routine used was from the gnuplot software, when fitting it weighted data points by the standard deviation of the 16 repeats of that point. For the conditional two-particle entropy fitting, where no standard deviation was available it was estimated to be 10/F kcal/mol, where F is the number of frames for that data point. Fig. 3 shows the radial distribution functions (RDFs) from the central solute to all solvent atoms. The troughs become deeper and the peaks become higher for increased solute potential ε s parameter. Also plotted is the RDF for the bulk solvent, which has the shallowest troughs and lowest peaks. There is a slight distortion in the second solvation shell, which becomes more pronounced with higher ε s . This is likely to FIG. 5. A plot demonstrating how the asymptotic values for the H s,2 and H l ,2 estimators were extracted. A power law is fitted to the three data points for each solute. The constant in the limit of infinite frames is used as in Eq. (45). be a gentle crowding effect as the solvent molecules in the first solvation shell draw close to the solute, and the closest locations for second shell atoms correspond to the pits in the already tightly packed first solvation shell. The magnitude of the solvent-solvent correlation entropy was expected to be small in the system of monatomic species used in this study. It has been shown that in the Lennard-Jones system, the primary source of the solvent structure comes from packing, 45 and this will correspond to an entropy associated with the volume exclusion of the central solute which is captured fully by the C1PE term. For more complex solvents, namely water, evidence already exists for the solvent-solvent correlations. 46,47 Fig . 4 shows the calculations of the free energy of solute annihilation from all three methods of evaluation with the standard deviations of 16 repeats used as error bars. The average values of these estimates are displayed in Table II along with the standard deviations of the 16 repeats. The average values for all methods are in good agreement. The IFST results including the conditional two-particle correction have the same averages and standard deviations of the first order IFST. This demonstrates that the C2PE has no clear contribution to the free energy of this system. The FEP results have the lowest standard deviation of all results even though a comparable length of MD run was used. The increased uncertainty associated with the IFST results likely arises from fitting power laws to extract the asymptotic entropy. The standard deviation in the free energy is at least 3 times greater than that of the MD energies used, which indicates the added uncertainty is from the entropy. This could potentially be remedied by taking more data points during the extrapolation process at the expense of data processing time. Both the average IFST results are within 0.16 kcal/mol of the FEP results. These results indicate that it is possible to reconstruct a fairly accurate measurement of the free energy change of this kind of process by only using the start and end points of an equivalent FEP path. The results indicate that the entropy term contributes relatively less to the free energy of solute annihilation for solutes with larger ε.

V. CONCLUSIONS
The translational entropy associated with the solutesolvent correlation and the solvent-solvent correlation can be used to evaluate the free energy of solvation in the IFST framework for a system of neon atoms solvating a fixed central Lennard-Jones potential. Both IFST estimates reproduce the estimate by an equivalent FEP simulation to within around 0.16 kcal/mol and appear to consistently overestimate slightly. The IFST method has an advantage over FEP and can give a free energy estimate without having to define a path between the two systems. However, the accuracy associated with FEP estimates is greater, so it is a better tool for computation. We conclude that IFST in its current state is the better tool for physical interpretation as it avoids the nonphysical lambda states utilized by FEP. We note that FEP is an already well-developed method, and IFST may yet have future improvements to increase its optimization. The conditional two-particle entropy term did not contribute a noticeable change in free energy for this system for any of the strengths of the solute potential tested.
The IFST framework can give the spatial contributions of configurational entropy when analysed with voxel based methods (as shown in Fig. 3 of the supplementary material). This allows areas of interest around the solute to be highlighted. Although the two-particle terms in the MIE do not appear to be significant in this simple system, they may be significant in a system with a liquid water solvent with hydrogen bonding and charges. The methods used in this work may be extended to such a system.
If the solvent-solvent interactions become very strong in a particular solvation process, it may be necessary to calculate the conditional three-particle entropy ∆S 3|s = S 1 solute − I 2 solute + I 3 solute − S 2 liquid + I 3 liquid (54) and the methodology used in this work could be extended to such a calculation.

SUPPLEMENTARY MATERIAL
See supplementary material for a description of the implementation of K-D trees in this work, and for a spatial illustration of the C2PE contributions around a fixed solute. The data and codes used for calculations in this work will be available at https://doi.org/10.17863/CAM.9335.