Training Algorithm Matters for the Performance of Neural Network Potential: A Case Study of Adam and the Kalman Filter Optimizers

One hidden yet important issue for developing neural network potentials (NNPs) is the choice of training algorithm. Here we compare the performance of two popular training algorithms, the adaptive moment estimation algorithm (Adam) and the Extended Kalman Filter algorithm (EKF), using the Behler-Parrinello neural network (BPNN) and two publicly accessible datasets of liquid water [Proc. Natl. Acad. Sci. U.S.A. 2016, 113, 8368-8373 and Proc. Natl. Acad. Sci. U.S.A. 2019, 116, 1110-1115]. This is achieved by implementing EKF in TensorFlow. It is found that NNPs trained with EKF are more transferable and less sensitive to the value of the learning rate, as compared to Adam. In both cases, error metrics of the validation set do not always serve as a good indicator for the actual performance of NNPs. Instead, we show that their performance correlates well with a Fisher information based similarity measure.


I. INTRODUCTION
Neural network potentials (NNPs) are one category of machine learning potentials 1-4 which approximate potential energy surfaces (PES) and allow for large-scale simulations with the accuracy of reference electronic structure calculations but at only a fraction of the computational cost 5 .
One prominent architecture of NNPs are Behler-Parrinello neural networks 6 which introduced the idea of partitioning the total potential energy of the system into effective atomic contributions. BPNNs have been applied to a wide range of molecules and materials [7][8][9][10][11][12] Despite these successes, the BPNN architecture relies on the selection of a set of symmetry functions before the training in order to describe the local chemical environments. On the contrary, features in deep-learning 13 are automatically learned via hierarchical filters, rather than handcrafted. In particular, graph convolution neural networks (GCNN), which consider the atoms as nodes and the pairwise interactions as weighted edges in the graph, have emerged as a new type of architectures in constructing NNPs for both molecules and materials [14][15][16] .
Innovations regarding new architectures certainly drive the improvement of the performance of NNPs. However, one hidden and less discussed issue is the impact of training algorithms on the construction of NNPs. As a matter of fact, earlier implementations of BPNNs 17-21 used either the Limited-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) 22 or the Extended Kalman Filter algorithm (EKF) 23 as the default optimizer, while recent implementations of GCNN architectures [14][15][16]24 almost exclusively chose the adaptive moment estimation algorithm (Adam) 25 instead. This contrast is, partly, due to the convenience that efficient implementations of Adam are available on popular frameworks such as TensorFlow 26 and Py-Torch 27 . Nevertheless, the practical equivalence among optimization algorithms for training NNPs is assumed rather than verified.
In principle, comparing training algorithms for the performance of NNPs is a highly non-trivial task. This largely comes from the fact that the final performance of NNPs is affected by many different factors which could convolute the comparison. These include the construction of dataset, the setup of loss function and batch schedule, the choice of neural network architecture, the hyperparameters of training algorithm, not to mention human errors in implementing these training algorithms and the corresponding neural network architecture. Even everything is properly controlled, the ground truth to judge the effect of training algorithm on the performance of NNPs in real-application scenarios may simply not be available.
To carry out this daunting task, our strategy is to use published datasets of liquid water where credible reference values in real-application scenarios are available 8,9 . This allows us to just focus on the effect of training algorithm, while the rest of hyperparameters, i.e. the NNP architecture (setups of the BPNN and symmetry functions), the loss function and the batch schedule, in addition to the dataset, can be inherited and kept fixed. To minimize the impact due to differences in implementation, we implement EKF in the TensorFlow-based PiNN code 16 , where the BPNN architecture was previously implemented and tested. The technical soundness of our implementation of EKF is verified against the EKF implementation in the RuNNer code 19 , which was used to generate NNPs in the reference works with the same datasets 8,9 . In this way, we can compare EKF and Adam on an equal footing, where the native implementation of Adam in Tensorflow is used.
Before showing these results, we will outline the details of our comparative study, including the training algo- rithms, datasets, the network architecture, and molecular dynamics simulations used for the density prediction, as described in the next section.

A. Loss Function and Batch Schedule
As the aim in this study is to elucidate the effect of training algorithm on the performance of NNPs , the definition of loss function and the choice of batch schedule need to be held consistent between EKF and Adam algorithms.
We denoteŷ(w, x) as the neural network prediction given the weight vector w and input vector x, and y as the training-data labels. Here, we define ξ = c/n(ŷ − y) as the scaled error vector, where n is the size of vector y and c is the weighting factor (for balancing number of energy and force labels). This leads to the L 1 loss function as L 1 = i |ξ i |. The corresponding L 2 loss function L 2 = i ξ 2 i is then related to the mean squared error scaled by c. Since EKF minimizes the error vector ξ and Adam minimizes the L 2 loss, this setup ensures that two algorithms optimize the same loss function as long as training data are the same.
Another distinction between the two algorithms (without diving into the details, as elaborated in the next Section) is on the batch schedule. EKF, as originated from the field of control theory and signal processing 28 , is designed for on-line training (i.e. one training datapoint at a time), in contrast to the mini-batch (a group of randomly selected training data) often used in Adam. This difference becomes blurred when the multi-stream variant of EKF is employed. 20 Nevertheless, we followed the practice of doing weight-update step based on either energy or force as in the previous studies, to avoid the complication of combining energy and force data in one step. 20 In the RuNNer setup of EKF, the weights are updated when a given number of error and corresponding Jacobian is computed. As the number of force labels overwhelms that of energy labels, a small fraction of force labels is typically used 20 . In addition, the labels can be filtered according to the magnitude of the error, which potentially improves the efficiency. An illustration of the batch schedule used in RuNNer is shown in Fig. 1a.
Here, we choose to apply a simplified scheme of batch schedule in which the weight update is based on randomly selected force or energy label in each iteration without filtering, which is computationally more efficient in Ten-sorFlow. In our particular instance, it is one energy label or ten force labels for each iteration, as illustrated in Fig. 1b. As shown in Result and Discussions, this simplification in batch schedule does not incur an inferior performance of the NNPs optimized with EKF. Subsequently, the same batch schedule is used in Adam for the sake of consistency.
Further details regarding the loss function and batch schedule, including the estimated number of weight updates based on energy and forces in each scheme and the weighting factor c thereof, are listed in the Supplementary Material (Section B).

B. Algorithms
In the following, we first introduce the common notations used for the optimization algorithm, and then briefly state the Adam and EKF algorithms compared within this work.
Continuing from the previous section, we denote g as the gradient of the L 2 loss function with respect to the weight vector w, and J is the Jacobian matrix of ξ with respect to w.
The first optimization algorithm used in this work is Adam, which is a popular algorithm for the training of neural networks. 25 The algorithm can be considered as an extension of stochastic gradient descent in which the first moment m(t) and second moment v(t) are estimated at each step, and used as a preconditioner to the gradients.
The algorithm is shown in Algo. 1.

Algorithm 1:
The Adam optimizer, where η is the learning rate, is a small number, β 1 , β 2 are the exponential moving average factors, m, v are the first and second moment estimates andm(t) andv(t) are the bias-corrected moment estimates. Here,β 1 = 0.9 and β 2 = 0.999 following the original publication; 25 Algorithm 2: The extended Kalman Filter (EKF) optimizer, where ξ is the error vector at each iteration, J is the corresponding Jacobian matrix, A is termed as the scaling matrix, K is the Kalman gain, P is the error covariance matrix and the observation-noise covariance matrix R is set to 1 η I, with η being the learning rate. 20 Following the setting in RuNNer 19 , the initial P(0) is set to an identity matrix and the process-noise covariance matrix Q is set to zero in this work.
init: t = 0, P(0) = I; while not converged do t = t + 1; The other optimization algorithm used in this work is EKF, which estimates the internal state of a system given a series of observations over time. In the context of neural network training, it can be interpreted as updating the weights of the neural network according to the gradient of the error with respect to the weights in past samples. 23 The algorithm is summarized in Algo. 2. Note that the notation used here follows Ref. 20, where the learning rate η is controlled by the observation-noise covariance matrix R. This formulation is more transparent as compared to earlier studies 29 .

C. Dataset description
Two datasets containing structures, energies and forces of liquid water (and ice phases) were used to train the NNPs in this work. 6324 structures in the BLYP dataset of both liquid phase and ice phases were taken from Ref. 8 and re-computed with the CP2K suite of programs 30  In this work, the NNPs were constructed using the BPNN architecture, 6 with the symmetry function taken from Ref. 8. Specifically, 30 symmetry functions for oxygen and 27 for hydrogen were used for the local environment description of atoms. The exact set of symmetry functions used is given in the Supplementary Material (Section A). This local description was fed to an elementspecific feed-forward neural network containing 2 hidden layers each comprised of 25 nodes with the tanh activation function and a linear output node to give atomic energy predictions, from which the total energy and force predictions are derived. The weight parameters were initialized using the Nguyen-Widrow method in RuNNer, 33 and the Xavier method in PiNN. 34

Algorithm hyperparameters
The PiNN-Adam models were trained for 5×10 6 steps, with a learning rate η that decays by a factor of 0.994 every 10 4 steps. Notably, the gradient clipping technique was used with the Adam optimizer to alleviate the vanishing and exploding gradients problems. 35 Specifically, gradient vectors g with g > 0.01 will be scaled by a factor of 0.01/ g during training.
The PiNN-EKF models were trained with a learning rate η for 5×10 5 steps. The RuNNer models were trained for 20 epochs (passes of the entire dataset), which corresponds to a total of around 3 × 10 5 steps. As shown in Algo. 2, the learning rate η of EKF is embedded in the covariance matrix R and implicitly decays according to the inverse of iteration (see Ref. 36 or Eq. 2 below). η is set to a constant in the PiNN setup; the counterpart of η asymptotically approach unity in the RuNNer setup, following the earlier studies. 19,29 In both cases, 80% of the dataset were used for training and the rest 20% were left out as a validation set. For each setup, 10 models were trained with different random seeds to split the dataset or initialize the weights of the neural network. Whenever applicable, the standard deviations of the prediction across the models are used as error estimates.

E. Molecular Dynamics Simulation
The molecular dynamics (MD) simulations for all the models were carried out with the ASE code. 37 The PiNN code supports calculation with ASE, while the RuNNer code 19 was interfaced to ASE through the LAMMPS 38 calculator implemented in ASE. The timestep for all simulations was chosen to be 0.5 fs. For constant particle number, constant pressure and constant temperature (NPT) simulations, the Berendsen barostat and thermostat 39 were used to keep the pressure at 1 bar and temperature at 330 or 320 K. For constant particle number, constant volume and constant temperature (NVT) simulations, the Berendsen thermostat was used to keep the temperature at 300 K, and the density was fixed at 1 g/mL. Each MD simulation was run for 100 ps , from which densities or radial distribution functions are computed from the later 50 ps.

III. RESULT AND DISCUSSIONS
A. The extrapolation regime: The case of the BLYP dataset   FIG. 2. 2D histogram of the BLYP dataset in terms of the total energy per atom and the bulk density. We excluded a small fraction of the ice-phase structures at very high densities, as compared to the original dataset 8 (shown as black outline).
The energy-density distribution of the BLYP dataset is shown in Fig. 2. This dataset contains structures from both liquid phase and different ice phases with a peak centering at 1.0 g/mL. 8 Given the fact that the equilibrium density of the BLYP water at ambient conditions is below 0.8 g/mL, 8 the isobaric-isothermal density at 1 bar and 330 K will serve as an instructive case study where the NNP is stretched into the extrapolation regime. This motivates us to discuss the result of the BLYP dataset first, where Adam and EKF show qualitative differences. Results shown in this section were generated using the learning rate of 10 −4 for PiNN-Adam and the learning rate of 10 −3 for PiNN-EKF, without loss of generality (see Section C in Supplementary Material for details).
Before presenting those results, it is necessary to first discuss the convergence speed of the NNP training. The EKF optimizer is known to converge much faster as compared to Adam (by approximately one order of magnitude in terms of the number of weight updates to achieve the desired accuracy) 20,40 . This phenomenon is clear in our training of NNPs, as shown in Fig. 3. However, the actual speed-up is compromised due to the higher computational cost of EKF. In practice, the training takes about 2 h for the EKF optimizer and about 5 h for the Adam optimizer on a 28-core computer to achieve similar levels of accuracy in terms of force and energy. It should be noted that the relative speed advantage per iteration increases drastically when the total number of weights grow.
When it comes to the performance of NNPs, one would expect that all the trained potentials with comparable error metrics should yield a consistent water structure at room temperature and experimental density, given the dataset shown in Fig. 2. Indeed, this is borne out, as demonstrated with the O-O radial distribution function in Fig 4a. However, this agreement diverges quickly when it comes to isobaric-isothermal simulations using the same NNPs, as shown in Fig. 4b. This suggests the corresponding densities should also differ from each

other.
We then proceed to see how well Adam and EKF estimate the isobaric-isothermal density at ambient conditions for the BLYP dataset. As shown in Fig 5, at the pressure of 1 bar and the temperature of 330 K, only 2 out of 10 models trained with Adam manage to predict a stable density, while most of the models trained using EKF lead to an excellent agreement with the previously reported density of around 0.76 g/mL for BLYP water. 8 In addition, the EKF implementation in PiNN can reproduce the results of RuNNer well for the same dataset.
The qualitative difference between NNPs trained with Adam and EKF seen in Fig. 5 is striking, in light of comparable force and energy errors shown in Fig. 3 and radial distribution functions shown in Fig. 4a. One may argue that the extrapolation is a difficult task for NNPs, because it requires stepping out from their comfort zone. Thus, it is also relevant to consider how the two training algorithms would fare within an interpolation regime. This leads to our second example, the revPBE0-D3 dataset.

B. The interpolation regime: the case of the revPBE0-D3 dataset
In the following experiment, we have used a dataset that was constructed to cover a wide range in the configuration space uniformly, 9 as seen in the energy-density  With this dataset, both Adam and EKF yield physical densities at the given temperature. All of 10 models trained with Adam, regardless of the learning rate, lead to stable NPT simulations, in stark contrast to the case using BLYP dataset (see Fig. S1 in the Supplementary Material for comparison). However, as shown in Fig. 7, the density prediction strongly depends on the learning rate used to train the model in the case of Adam.
It is tempting to conclude that the models trained with larger learning rates are the better performing ones due to smaller force and energy errors. Similarly to the observation made for the BLYP dataset, the error metrics do not seem to correspond to the actual prediction performance of the trained NNPs. Indeed, we notice that the density does not seem to converge when the error metrics ( Fig. 7) have reached lower values. Given that the reference density value for this dataset at 1 bar and 320 K is 0.92 g/ml 9 , this suggests that one has to rethink the common wisdom on performing model selection based on error metrics, such as RMSE. 41,42 Regardless of this implication, the present case study demonstrates that the performance of NNPs can be sensitive to the learning rate, in particular for Adam. As a consequence, this can lead to a tangible difference in terms of the density prediction even within the interpolation regime where the dataset already has a good coverage.

C. Differences in models trained using Adam and EKF
In the previous sections, we found that NNPs trained with EKF seem to be less sensitive to the learning rate and more generalizable. Meanwhile, it is also clear that model selection based on the error metrics of energy and force may not always lead to a sound choice. While unintuitive, the observation is not completely unexpected. The error metrics are typically evaluated on randomly left-out validation sets, which underestimates the generalization error due to their correlation with the training set. Since the training set has to cover sufficient region in the configuration space, it is hard to construct a test set independent of the training set.
Then, the questions are i) can we distinguish "good" and "bad" NNPs by just looking into the weight space (rather than doing the actual MD simulations for the density prediction)? ii) why does EKF do better for training NNPs than Adam?
To answer the first question, we have analysed the models presented in the earlier sections to shed some light on the characteristics of the models trained with Adam and EKF.
One possible measure to distinguish different models is the Euclidean distance of the optimized weights to those in the initialization , which characterizes the implicit regularization effect of stochastic gradient descent. 43 Denoting the vector of all weight variables in the neural network as w, we compared the evolution of the weights distance from initialization ∆w = w − w init , and also the initialization-independent distribution of the weights w i of the final models for the trained networks.
As shown in Fig 8a, different NNPs trained on the BLYP dataset display similar weights distances from initialization. Moreover, the weight distributions of NNPs trained with Adam and EKF are almost indistinguishable as shown in Fig. 8b. This suggests that the norm-based measure also fails to distinguish models obtained from Adam and EKF. It further hints that using a L 2 normbased regularization may not improve the performance of NNPs.
Another class of similarity measure is related to the local information geometry of the neural network. 44 Here, we characterized it with the Fisher information matrix I: where ⊗2 denotes the 2nd tensor power of the gradient vector. Here we used L 1 loss instead of L 2 loss to construct the Fisher information matrix for reasons that will be clear later. Fig. 9 shows the distribution of eigenvalues λ I of the Fisher information. For the BLYP dataset, EKF found local minima with much larger eigenvalues as compared to Adam, where the peak of the distribution differs by about one order of magnitude (note the logarithm scale used here). Therefore, the Fisher information is a similarity measure which can effectively distinguish the performance of NNPs.
A similar trend was observed for the revPBE0-D3 dataset. The eigenvalue distribution of NNP trained with Adam and a small learning rate (1E-5) is more close to those trained with EKF. This deviation becomes larger when increasing the learning rate (1E-4 and 1E-3). In contrast, eigenvalue distributions of NNPs trained with EKF are almost identical. These results correlate well with the density prediction from MD simulations as shown in Fig. 7.
The usefulness of the Fisher information for model selection is also linked to the second question posed at the beginning of this section. Despite different motivations, both Adam and EKF can be understood in terms of natural gradient descent (NGD), where the inverse of the Fisher information matrix I is used as a preconditioner.
In Adam, I is approximated as a diagonal matrix, with the diagonal terms thereof being the second moment vectorv(t). The inverse square root of the diagonal elements are used as a conservative preconditioner.
On the other hand, the connection between EKF and NGD has been shown recently 36 , where the inverse I matrix is effectively estimated by the error covariance matrix P in EKF. In fact, one can show that when the covariance matrix of observation noise R in Algo. 2 is chosen to be a scaled identity matrix 20,36,40 , i.e. R(t) = 1 η I. This means EKF can be viewed as the online estimate of the full Fisher information I with respect to L 1 loss and the 1/t decay of the learning rate. We think such feature leads to a good similarly measure of models using the Fisher information based on Eq. 1 and a superior performance of EKF for training NNPs as demonstrated in two case studies shown in this work.

IV. CONCLUSION AND OUTLOOK
To sum up, we have compared the performance of two optimization algorithms Adam and EKF for training NNPs of liquid water based on BLYP and revPBE0-D3 datasets. It is found that NNPs trained with EKF are more transferable and less sensitive to the choice of the learning rate, as compared to Adam. Further, we show that the Fisher information is a useful similarity measure for the model selection of NNPs when the error metrics and the normal-based measure become ineffective.
Established practice in other neural network applications 45 , using Adam, indicate that practical training performance can be improved if an architecture is reparametrized to promote weights in a common numerical range, without strong correlations. Another future avenue would thus be to preserve the expressive power of existing NNPs, while expressing the weights in such a way that the off-diagonal elements of the resulting Fisher information matrix are minimized. This can be understood in terms of the limitations in the Adam preconditioner strategy.
On the other hand, EKF scales quadratically with the number of trainable parameters in the network, while Adam scales linearly. This makes EKF a less favorable choice when more heavily parameterized models like GCNN [14][15][16] are of interest. Necessary approximation to the estimation of I, such as that based on Kronecker factorization, 46 is essential to transfer the present observation to those models.
Before closing, it is worth to note that the present issue of training algorithms may be overcome with more data (as shown in Fig. 9 with augmented dataset in our previous work 16 ), for which a number of active-learning approaches tailored for NNPs have been proposed. 11,[47][48][49][50] However, we would argue that a better training algorithm does not only improve the performance of NNPs but also improve the data efficiency in the active-learning.
This work focuses on the practical role of training algorithm in the performance of NNPs, where different approximations to the Fisher information (in Adam and EKF) and the choice of learning rate are factors primarily considered. Nevertheless, as mentioned at the very beginning, other factors, such as loss function and batch schedule, neural network architecture, decay factor and decay steps in training algorithm will bring new dimensions into the discussion. In addition, the examples shown here are based on published dataset of liquid water and the realapplication scenario is density prediction. Therefore, the conclusion drawn in this work is yet to be confirmed in more complex systems and other physico-chemical properties, where credible reference values in real-application scenarios can be obtained. Overall, this calls for more systematic investigations to shed light on this topic and the practice of open science to facilitate similar studies in the community. With these efforts, the development of NNPs will become more transparent and reproducible, which in turn will provide reliable insights into interesting chemical physics and physical chemistry problems.

SUPPLEMENTARY MATERIAL
See Supplementary Material for symmetry function parameters, weighting factor in batch schedule and loss Function, and learning rate dependency of error metrics and density for the BLYP dataset.

ACKNOWLEDGMENTS
CZ thanks the Swedish Research Council (VR) for a starting grant (No. 2019-05012). We also acknowledge the Swedish National Strategic e-Science program eSSENCE for funding. Part of the simulations were performed on the resources provided by the Swedish National Infrastructure for Computing (SNIC) at UPPMAX and NSC.

DATA AVAILABILITY STATEMENT
Datasets used in this work are publicly accessible from DOI: 10.5281/zenodo.2634097 and DOI: 10.24435/materialscloud:2018.0020/v1. The PiNN code is available from https://github.com/Teoroo-CMC/PiNN and its EKF implementation will be released in the next version and provided upon request in the interim.

C. Learning rate dependency of error metrics and density for the BLYP dataset
The effect of learning rate on the error metrics and density prediction for the BLYP dataset is shown in Fig. S1. In the Main Text, the learning rate for Adam is set to 10 −4 to represent the best hyperparameter determined by the error metrics; the learning rate for EKF is set to 10 −3 . Figure S1: RMSE and density predictions for NNPs trained with the Adam and the EKF algorithms at different learning rates: a) energy RMSE; b) force RMSE and c) density predictions at 330K, 1bar. The error bars are estimated using the standard deviation between models trained with the same setup whenever possible. The number of models (out of 10 instances) that predict stable densities is annotated in the parentheses in panel c.