Facilitating ab initio conﬁgurational sampling of multicomponent solids using an on-lattice neural network model and active learning

We propose a scheme for ab initio conﬁgurational sampling in multicomponent crystalline solids using Behler-Parinello type neural network potentials (NNPs) in an unconventional way: the NNPs are trained to predict the energies of relaxed structures from the perfect lattice with conﬁgurational disorder instead of the usual way of training to predict energies as functions of continuous atom coordinates. An active learning scheme is employed to obtain a training set containing conﬁgurations of thermodynamic relevance. This enables bypassing of the structural relaxation procedure which is necessary when applying conventional NNP approaches to the lattice conﬁguration problem. The idea is demonstrated on the calculation of the temperature dependence of the degree of A/B site inversion in three spinel oxides, MgAl 2 O 4 , ZnAl 2 O 4 , and MgGa 2 O 4 . The present scheme may serve as an alternative to cluster expansion for ‘difﬁcult’ systems, e.g., complex bulk


I. INTRODUCTION
Configurational order/disorder determines many properties of functional materials including mechanical strength, catalytic activity, ion/electron/phonon conductivity, and so on.Configurational disorder can be predicted using Monte Carlo (MC) methods based on statistical mechanics, but MC methods usually require a huge number of energy evaluations.Because of this, most works in the past have relied on effective low-cost models that are fitted to density functional theory (DFT) calculations [1][2][3] .For solid-state systems that can be mapped onto a lattice, the cluster expansion method [4][5][6][7][8][9][10][11] is often considered the de facto standard for obtaining such effective models.This method describes configuration energies using an expansion based on effective cluster interactions (ECIs) : where α denotes empty, single, and multi-component clusters, m α is the cluster multiplicity, V α is the ECI, and σ is the occupation vector.Φ α is the cluster basis function, which for a binary system takes the simple form of where p are site indices and σ p = ±1 depending on which component occupies site p.Appropriate basis functions for systems with more components can also be generated accord-ing to, e.g., Ref. 8. We note that the validity and formal approximations involved in mapping a realistic alloy model with vibrations and intra-atomic excitations to an on-lattice model has been explored rather thoroughly by Ceder 5 .
The ECIs are usually fitted to energies of relaxed structures from small-scale DFT calculations, then used to calculate energies of a much larger supercell in the course of MC sampling calculations.Although this cluster expansion approach has seen much success, especially for metallic alloys, certain limitations and difficulties have also been pointed out over the years.For example, it is known that the accuracy of the cluster expansion energy prediction degrades when there is significant lattice relaxation 12 .In fact, cluster expansion with constant ECIs is, in a way, a zeroth order approximation and formally cannot provide an exact fitting including relaxation effects 9 .Another difficulty is in choosing the finite number of clusters that will give the best predictions.For example, Seko et al. reported a spurious prediction of a discontinous orderdisorder transition in an oxide system with long-range interactions when choosing clusters that perform best within limited supercell sizes 13 .Still another issue is in how to avoid undesirable training set bias.These issues are, in fact, highly nontrivial, and developing robust methods for cluster and training set selection is still an area of active research 7,11,[13][14][15] .Also, cluster expansion becomes computationally demanding in a combinatorial way as the number of components and sublattices increases, and this limits the number of clusters that can be included in the expansion 10 .
In view of these issues, we as well as some other workers have opted to bypass fitted models and sample di-rectly on DFT energies [16][17][18][19] .Using more sophisticated MC schemes that are suited for parallel computation such as Wang-Landau 20 or replica exchange Monte Carlo (RXMC) 21 sampling, sufficient statistical sampling has been achieved on calculation models with up to a few hundred atoms.However, some of these works required weeks of calculations on up to 100 supercomputer nodes, and much acceleration is necessary if this type of approach is to be used in a more widespread manner on a variety of materials systems.Towards this end, we consider, in this work, the so-called "active learning" approach, which has been developed to accelerate first-principles molecular dynamics simulations or MC calculations [22][23][24][25][26][27][28][29][30] .The basic idea is to use machine-learning potentials (MLPs) that have been fitted to DFT results to accelerate the calculations.Since MLPs are usually good at interpolating but not at extrapolating, a relearning is performed when the system wanders into a previously unlearned region of structure space, then the simulation is restarted with the newly tuned potential.Here, we apply this idea to the lattice configuration problem.
Many forms of MLPs have been proposed in the literature; in this work, we employ the high-dimensional neural network potential (NNP) scheme proposed by Behler and Parinello [31][32][33][34] .This scheme assumes that the total energy can be decomposed into atomic energies determined by the environment around each atom and uses neural networks to fit these atomic energies.That is, the total energy is expressed as where the configuration vector σ is represented by continuous atom coordinates in contrast to the cluster expansion (Eq. 1) where σ represents occupations on a discrete lattice.t i represents the atom type, meaning that a unique NNP is fitted for each atom type.σ R c i ⊂ σ represents coordinates of atoms within a cutoff radius R c of atom i, and f is a fingerprint function that transforms σ R c i into a multidimensional descriptor of the atomic environment.To be physically meaningful and transferable, f must be chosen so that the resulting descriptor is invariant with respect to translation and rotation of the system.To this end, several fingerprinting schemes have been devised including symmetry functions [31][32][33] , smooth overlap of atomic positions (SOAP) 35 , and Deep Potential 36,37 .In this work, we employ the Chebychev polynomial-based fingerprint proposed by Artrith et al. 38 ; the effectiveness of this fingerprint function has been demonstrated especially for multicomponent systems.The effectiveness of other fingerprints or MLP forms for our purposes is certainly of interest, and may be considered in future works.
The main distinguishing point of this work compared to the literature on high-dimensional NNPs is that we train the NNP model to predict the total energies of relaxed structures from unrelaxed ideal lattice structures with configurational disorder.This is quite different from the usual approach of learning the total energies as a function of continuous atom coordinates (the only exception we are aware of is Ref. 39, which used the NNP scheme to describe the relaxation energy when a single Cu interstitial is dissolved in amorphous Ta 2 O 5 ).The obvious merit compared to conventional NNP approaches is that we can bypass structural relaxation, while the demerit is that we lose access to ion dynamics, and that we cannot expect transferability to other lattice structures.Since relaxation calculations in a moderately sized supercell can sometimes take hundreds of steps, the speedup attained in this manner can be significant.On the other hand, the abovementioned demerits are not an issue when using the NNP model for MC sampling on a lattice (it should be noted that MC sampling in continuous coordinate space using random trial displacements is quite inefficient because atoms are more or less close-packed and such trial steps will almost never lead to exchange of atom sites 18 ).
In a sense, this is a "lazy" alternative to the cluster expansion method without the need to explicitly consider optimized bases, i.e., clusters; there is also no need to perform any explicit basis reduction based on lattice symmetry.This circumvents various issues associated with choosing the finite number of clusters to be used for prediction that were mentioned above.Also, promising atom-environment descriptors have been proposed for treating multicomponent and multisublattice systems without combinatorial explosion 38 , which is an unsolved issue for cluster expansion.We can also use the same fingerprint functions for bulk, surface, and interface systems, while cluster expansion requires more clusters for nonbulk systems 40 .Another merit is that the nonlinearity of the NNP model can lead to better convergence in practice compared to linear models (such as the cluster expansion) 38 .This may lead to more accurate description of systems with significant lattice relaxation where cluster expansion exhibits some difficulties 12 .We will explore this aspect in the following sections.A remaining issue is in the choice of input structures for training, which is just as problematic as other approaches.In fact, we demonstrate a rather spectacular failure when training only on randomly generated configurations, and then show how well the active learning approach can solve this issue.
As a side note, we point out that the NNP can be directly trained to predict the energy as a function of the configuration vector σ , as was done in Ref. 41.However, the transferability of the NNP to larger supercells in such a scheme is nontrivial, and thus we advocate the use of atom-centric fingerprint functions.It is also possible to perform nonlinear cluster expansion, i.e., fitting the total energy as a function of cluster correlations using a neural network 42 .Yet another successful model is the low rank potential, which expresses the interatomic potential as a low rank decomposition of a tensor whose indices specify the local atomic environment 27,43 .We do not claim that the approach presented in this work is an overall better method, as it is difficult to compare overall performance including calculation speed, accuracy, and ease-ofuse on equal footing.However, as noted above, cluster selection or explicit mapping to a tensor-like expression required in other schemes can become intractable with increasing complexity of the system.Thus, we believe that the current approach will enable configurational sampling in complex bulk or interface systems with many components and sublattices, i.e., those systems where it is technically difficult to employ previously proposed approaches.

A. Benchmark system: spinel oxides
We demonstrate the above idea on the calculation of the degree of A/B-site inversion in a few oxides with spinel-type structures.Spinel oxides share the general formula AB 2 O 4 with A (B) representing a divalent (trivalent) cation, and are of interest in mineralogy as well as technological applications including magnets, photocatalysis, and battery materials [44][45][46] .In an "ordered" or "normal" spinel, all divalent cations occupy the tetrahedrally coordinated sites, and all trivalent cations occupy the octahedral sites.However, it is known that spinels with various divalent/trivalent cation combinations can exhibit significant deviation from the "normal" occupation, and this is often quantified by the degree of inversion (DOI) defined as the fraction of A-sites (tetrahedral) occupied by B (trivalent) cations.The DOI corresponding to a completely random occupation is 2/3, while a nearly "inverse" occupation with DOI of nearly unity is also known for some compounds.Prediction of the DOI as a function of temperature is of interest because of its potential impact on the magnetic, optical, thermal, and electrochemical properties of the material.
In this work, we calculate the degree of inversion in MgAl 2 O 4 , ZnAl 2 O 4 , and MgGa 2 O 4 and compare to a series of cluster expansion works by Seko et al. 7,13,14,47 The former two are known to be nearly normal, while significant inversion has been reported for MgGa 2 O 4 .Spinels constitute a prototypical multivalent system (i.e., A 2+ and B 3+ share lattice sites) where a naive cluster expansion can lead to large qualitative errors due to overfitting in a small calculation cell; augmentation with a screened point charge model (CE-SPCM) was found to be necessary to obtain accurate predictions without considering rather long-range pair clusters in Ref. 13.Also, the feasibility of avoiding cluster expansion fitting and sampling directly on DFT-relaxed energies was demonstrated in combination with RXMC sampling in Ref. 18 with a 48cation model.Here, we consider a 192-cation model, which is completely beyond the reach of direct DFT sampling for two reasons: (1) because of the longer time required for each DFT calculation, and (2) the much larger configurational space that has to be sampled.Regarding (2), the number of degrees of freedom for the 48 cation model is 48 C 16 ∼ 2 × 10 12 , while that for the 192 cation model is 192 C 64 ∼ 7 × 10 51 .This explosion in the configuration space should attest to the curse of dimensionality that we are fighting against, although there is actually no need to perform calculations on all of these configurations if good sampling schemes (such as RXMC) are available.

B. DFT calculations
The reference DFT calculations were performed using VASP code 48,49 .We employed the projector augmented wave method 50 to describe electron-ion interactions.A plane wave basis set with an energy cutoff of 400 eV was used to expand the wave functions.The Brillouin zone was sampled only at the Γ point.The GGA-PBE functional 51 was used to approximate the exchange-correlation energy.In this work, we performed relaxation of the lattice vectors as well as the internal coordinates to within 0.04 eV/Å.It should be noted that lattice vector relaxation leads to changes in the calculation mesh density, which means that the effective cutoff energy deviates from the preset value.To alleviate this issue, the relaxations were repeated several times so that the final structures and energies are consistent with the cutoff energy of 400 eV.

C. NNP training and evaluation
The NNP training and subsequent evaluation were performed using Atomic Energy NETwork (aenet) code 38,52 .As noted above, the NNP model is trained to predict relaxed energies.This may be represented using a slightly modified version of Eq. 3: where E rel ( σ ) represents the energy after structural relaxation from a starting structure σ , f [ σ R c i ] represents atomic environment descriptors for the starting structure, and NNP rel t i represents atomic energy contributions to the relaxed total energy.The training and use of the NNP is restricted to σ ∈ { σ lattice }, where { σ lattice } represents the set of ideal lattice structures with configurational disorder.Equation 4states that the relaxed energies can be calculated as a sum of atomic contributions, which, in turn, can be calculated from coordinates corresponding to the ideal lattice.This is an ansatz that we are making.In general, it is impossible to prove the exactness of a certain machine learning model in reproducing a physics model, and our approach is no exception.This is in contrast to cluster expansion for which a rigorous proof exists for mapping fully relaxed energies to a lattice-based Hamiltonian 5,9 .The ansatz will need to be verified for each system under study, and active learning is an ideal approach in this regard.
We used the fingerprinting scheme by Artrith et al. 38 , where the radial and angular distribution functions (RDF and ADF) with an appropriately chosen cutoff are expanded by an orthogonal basis set based on Chebychev polynomials, and the expansion coefficients are fed in to the neural network as the input descriptors.A key point in their scheme is that the structural and compositional descriptors are given separately; the structural descriptor is given by the usual RDF and ADF, and the compositional descriptor is also given by RDF and ADF but with atomic contributions weighted differently for each chemical species.In this work, we expand the RDF with a cutoff of 8.0 Å and expansion order 16 and the ADF with a cutoff of 6.5 Å and expansion order 4. The employed neural network thus has an input layer with 44 nodes 53 , and we chose to use 2 hidden layers with 15 nodes each and the tanh activation (there is actually an additional bias node to accommodate arbitrary references for the energy 52 ).The training epocs were monitored and terminated when the test set error began to increase.We note that it is necessary to validate the model on an additional data set that is separate from testing data used for stopping of the training epocs.This is done, in a sense, through the active learning procedure detailed below.

D. Replica exchange Monte Carlo sampling combined with active learning
The configurational sampling was performed using the replica exchange Monte Carlo (RXMC) method combined directly with NNP evaluation using aenet.The RXMC method employs several copies, or replicas of the system under study.Each of the replicas are sampled using the usual Metropolis MC algorithm at different temperatures.At preset intervals, temperatures are swapped between replicas according to the Metropolis criterion with the following probabilities: Here, i, j are the replica indices, β is the inverse temperature 1/(k B T ), and E is the energy of the replica.Higher temperature replicas are given the task of global surveillance of the energy landscape while lower temperature replicas basically perform local optimization.The swapping of temperatures occurs when a higher temperature replica finds a new energy minimum; local optimization is then performed at the lower temperature.The detailed theory behind RXMC is given in Ref. 21.Our previous works 18,19 may also be helpful for understanding the basic concept.A parallelized code for RXMC calculations in combination with DFT codes as well as aenet is available as a part of ab-Initio Configuration Sampling Toolkit (abICS) https://github.com/issp-center-dev/abICS.In this work, 15 replicas of the system were sampled in parallel with temperatures ranging from 400 K to 1900 K, and temperature exchange was attempted every 4 steps to speed up the global sampling.Equilibration was performed for 80000 steps, then the DOI, which is calculated as the ratio of Al ions on Mg sites, was averaged from 400000 steps to obtain the temperature dependence.The NNPs used in the RXMC sampling were refined iteratively using the active learning scheme shown in Fig 1 .We note that a very similar process using an effective pair interaction model was reported recently 54 .First, we train an NNP set with randomly generated samples, then use those NNPs to perform RXMC sampling.In subsequent iterations, we sample a preset number of configurations from the previous RXMC run and perform DFT relaxations on those configurations.Comparison of the DFT energies with the NNP predictions serves as the validation procedure for the machinelearning model, although the validation data set will be biased somewhat as we discuss below.Discrepancies between DFT energies and NNP predictions will signify that the simulation has wandered into previously unlearned regions of configuration space.We then add those DFT results to the training set, retrain the NNPs, and run RXMC sampling again.This process is repeated until convergence is achieved in the physical quantities of interest.In this work, we sampled 300 structures from the first 6400 RXMC steps in each active learning iteration.The final DOI's were calculated from a separate longer RXMC run with 480000 RXMC steps (80000 for equilibration and 400000 for thermodynamic averaging) as mentioned above.
We note that the above procedure biases the training data set towards thermodynamically relevant configurations predicted by the NNP model in that active learning step.The thermodynamic biasing, in itself, is usually not a bad thing since we are often only concerned with thermodynamically relevant, i.e., realistic configurations.However, the resulting NNPs cannot be expected to perform well when used to predict energies of configurations with both low entropy and high energy, since such configurations are thermodynamically unstable over all temperatures and rarely visited in the RXMC procedure.Such an example in the case of AB 2 O 4 spinels considered in this work would be a phase-separated configuration with "A" cations in one half of the cell and "B" cations in the other half.If, for some reason, we want to correctly reproduce the energy of such situations, we would have to add such configurations to the dataset, or we may accumulate training data using active learning with varying compositions.
On the other hand, the fact that we are relying on an inaccurate NNP model to generate the training set during the active learning iterations may pose issues with uncontrolled biasing of the training data.In the case where the model underestimates the energy of certain configurations, the RXMC algorithm at lower temperatures will bias the sampling towards those configurations so that the underestimation is quickly corrected in the subsequent iteration.The biggest foreseeable problem is when the model overestimates the energy.This is covered to some extent by setting a high enough temperature for the RXMC sampling so that configurations with higher predicted energies will still be sampled.It is also noted that RXMC in each iteration is initialized with random configurations, i.e., with an effective temperature of infinity.Ultimately, it is necessary to evaluate if such issues are severe through computational experiments, which we present in the following sections.
We should also mention that we are using the term "active learning" in a rather broad sense, that the training set configurations are not given a priori but are chosen according to the given calculation procedure.In our case, this is done through thermodynamic importance sampling as explained above.On the other hand, many works in the active learning MLP literature employ various uncertainty estimates to determine if a structure met during simulation is in an extrapolation region and should be added to the training set 22,30 .The efficacy of using such uncertainty indicators to further limit the number of expensive ab initio calculations may be explored in the near future.

A. Active learning
We started the active learning process with 300 randomly chosen configurations.DFT relaxation and energy calculations were performed on these configurations, then an NNP model was trained to predict relaxed energies from the input configurations.The oxygen sublattice is neglected in the training process since the input structures all have identical oxygen coordinates and do not provide any useful information for the prediction.We used 90% of input structures for training and 10% for testing.For the spinel systems studied in this work, we typically obtained mean absolute errors (MAEs) and root mean squared errors (RMSEs) of less than a few meV/cation in the training and test sets, which is close to the accuracy limitations of the DFT calculations themselves.However, such low fitting error does not guarantee high accuracy in predicting the energies of structures obtained in the RXMC sampling.This is checked by performing DFT relaxations on the structures encountered during RXMC sampling using the NNP.As shown in Fig. 2, the NNP trained on random samples (indicated by blue circles in Fig. 2) shows poor accuracy, especially in predicting the energies of the lower energy configurations.This is because the model was trained on random configurations with higher energies, and the training set did not include lower energy structures with less disorder (please note that the splitting of samples between higher and lower reference DFT energies seen here and in subsequent figures originates from our sampling procedure and is not an inherent property of the system. 55).Increasing the number of random samples does not improve the results as we show later.The importance of having configurations with varying degrees of order in the training set has already been pointed out in the cluster expansion literature.A suggested solution was to perform a grouping of input structures based on correlation func-tions (which roughly corresponds to the degree of disorder), and to perform cross validation on each group during the fitting process 14 .Here, we take an alternate approach, i.e., active learning, as was detailed in Sec.II D. The predictions at lower energies progressively improve with the number of active learning iterations because of the thermodynamic biasing mentioned above; the prediction RMSE for the three spinel systems studied here converged within 0.5 meV after the third iteration.This is clearly smaller than ∼10 meV/cation reported in Ref. 13 using cluster expansion, although it may not be a fair comparison because of the difference in the data set used to calculate prediction errors.
Convergence is also attained in the physical quantity of interest, i.e., the DOI, as shown in Fig. 3.For the three spinel oxides treated here, 2 or 3 active learning iterations with 300 DFT relaxation calculations each were sufficient to obtain a converged DOI plot (Fig. 3, purple diamonds), which looks completely different from that predicted using NNPs trained on randomly generated configurations (Fig. 3, blue circles).The final DOI plots are in good agreement with most of the cluster expansion literature as well our previous work employing direct RXMC sampling on DFT energies (Fig. 4).There is, however, a sizeable deviation for MgGa 2 O 4 , and there is one cluster expansion work for MgAl 2 O 4 that deviates from the others and shows a discontinuous transition around 600 K 7 (denoted by "Seko2009" in Fig. 4).As for MgGa 2 O 4 , the current work predicts a DOI which is about 0.1 smaller than cluster expansion in Ref. 47.Our result is within the range of most of the available experimental data that report 0.75 56 , 0.67 57 , 0.84-0.90 58, and 0.81-0.84 59, although the first work that put forth the idea of an "inverse spinel" reported unity, i.e., complete inversion 60 .The discrepancy may come from two sources apart from the difference in the employed models (i.e., cluster expansion and NNP): the supercell size used in the training stage, and that used in the MC sampling stage.The cluster expansion literature has performed ECI fitting in supercells with 24 cation sites or smaller, while they performed MC simulations on supercells with as many as 48000 cations.Here, both the training and sampling were performed on 192-cation supercells.The impact of these supercell sizes on the results are discussed in more detail in Sec.III B. As for "Seko2009" 7 , the discontinuous transition was reported in a later work to be a spurious result due to truncation of the range of pair clusters in the cross validation process 13 .
To confirm that the active learning procedure is indeed more efficient than random sampling, we performed DFT relaxation on 1200 randomly generated configurations of MgAl 2 O 4 , trained the NNP on those configurations, and compared the results to those from active learning with the same number of DFT calculations (1 random + 3 active learning iterations, which was enough to converge the DOI results as shown in Fig. 3a).The training on the random dataset was performed independently twice with different training/test data separation and initial neural network weights.As shown in Fig. 5a, the energy predictions from NNPs trained on random configurations deviate from the DFT value for configurations with lower energies, and it is also clearly less accurate than the NNP trained through the active learning procedure.Again,

MgAl₂O₄
ZnAl₂O₄ MgGa₂O₄ this is not surprising because lower energy configurations are seldom included in the randomly generated training set.Moreover, the deviation behavior is rather uncontrolled as indicated by different predictions in the energies and the DOIs (Fig. 5b) for the two independent training runs; thus, it is virtually impossible to obtain the correct temperature-dependent results with only random generation of the training dataset.Savings in computer time achieved by the current approach turned out to be quite significant compared to sampling directly on DFT energies.On the Xeon E5-2680v3 CPU, the NNP prediction using 1 CPU core is a few hundred to a few thousand times faster than DFT relaxation using 216 cores depending on the number relaxation steps necessary for convergence.We also note that the number of necessary DFT calculations turned out to be orders of magnitude smaller than the number of NNP evaluations, i.e., RXMC steps, necessary for converging the DOI.A converged DOI was obtained for ZnAl 2 O 4 and MgGa 2 O 4 at the first active learning iteration, while the DOI for MgAl 2 O 4 was converged at the third active learning iteration (Fig. 3), which correspond to 600 and 900 DFT relaxations, respectively.On the other hand, ∼ 7 200 000 NNP evaluations (= 480 000 MC steps/replica × 15 replicas) were necessary to obtain a converged DOI vs. temperature.The ∼ 1000 DFT calculations can be completed within a few hours on a modern supercomputer system since the calculations can be performed completely in parallel.The RXMC calculations took roughly one day using 15 CPU cores, which would fit in one node of a modern workstation.We also note that the RXMC simulations were initialized with random configurations, and yet they still found the single ground state ordered spinel configuration out of ∼ 10 51 possible configurations for the normal spinels MgAl 2 O 4 and ZnAl 2 O 4 .This attests to the effectiveness of modern sampling methods in battling the curse of dimensionality.
The fact that less than 1000 reference samples were necessary to obtain an accurate NNP model might be surprising, as previous works on NNPs usually report at least one order of magnitude larger reference sets (see, e.g., Ref. 37,52 ).This is most likely because the input configuration space is restricted to σ ∈ { σ lattice }, and this is a clear merit of our ap- The results are also compared to the cluster expansion (CE) literature 7,13,47 and our previous work using DFT calculations directly in combination with RXMC 18 in a 48-cation model (lines).
proach.It should be pointed out, however, that each of our reference samples is a result of DFT relaxation calculations, which comes at an additional cost compared to conventional NNP approaches that can be trained on structures without relaxation.Still, it should be much easier to attain sufficient sample coverage in this restricted configuration space.The restriction also makes the problem easier for the neural network, as it needs to distinguish between much fewer structures.These facts combined with the ability to bypass lattice relaxation make the present approach more suitable for the lattice configuration problem in comparison to conventional continuous-coordinate NNP methods.That is not to say that those methods cannot be applied to systems with configurational disorder.In fact, the Deep Potential approach has been applied to randomly generated high entropy alloys 37,61 with promising results, and it can provide vibration properties while the approach cannot.However, the computational expense for training and simulation of configurational order/disorder behavior will be much larger than our lattice-restricted approach, and they are yet to be applied to this type of calculation.

B. Convergence vs. supercell size used for training and Monte Carlo sampling
The usual approach in the NNP and cluster expansion literature is to use a set of many small-scale DFT calculations to fit the model and use that model to accelerate calculations in an expanded supercell.In doing so, one must be careful that (1) the fitting procedure does not lead to overfitting in the small cells at the cost of transferability to larger cells not contained in the training set, and (2) thermodynamically relevant correlations are contained within the training set supercell size.These points are especially important here since spinel oxides are mulitvalent ionic oxides where significant long-range interactions are expected.
In the case of cluster expansion, cross validation is usually performed within the training set to truncate the number of clusters to a computationally tractable amount, but for systems such as MgAl 2 O 4 , this procedure can lead to large prediction errors for structures with longer periods than those in the training set 13 .This lead to a spurious prediction of a discontinuous order-disorder transition in MgAl 2 O 4 at ∼600 K in Ref. 7 (denoted by "Seko 2009" in Fig. 4); it was found to be necessary to either ignore the truncation procedure and use as many long-distance pair clusters as possible, or to use fewer pair clusters and augment the cluster expansion Hamiltonian with a screened Coulomb term, which lead to prediction of a more continuous transition in Ref. 13 ("Seko 2014" in Fig. 4).It is difficult to see if similar issues may arise in our approach.Although there is no explicit truncation procedure, neural networks are prone to overfitting to the available data; the network might decide that interactions beyond the training set supercell dimensions are unimportant and effectively truncate the range of interactions.Even if it does not, we would have to see if the cutoff radii of 8 Å for the radial descriptor and 6.5 Å for the angular descriptor was long enough for sufficient precision.
To check for such issues, we examined the convergence behavior against the supercell size used for training and MC sampling as follows.First, we performed active learning of the NNP model in a conventional cell of cubic spinel having 24 cation sites in addition to the 192-cation model examined above.Two active learning iterations were performed to converge the results for the 24-cation model.Then, we performed RXMC sampling calculations on the 24-cation, 192cation, and 648-cation models using those NNP models.This means that we performed 6 RXMC calculations resulting in DOI data that can be indexed by the number of cation sites in the training and sampling steps as DOI(N train , N sample ). Figure 6 shows the results for MgAl 2 O 4 and MgGa 2 O 4 , where the left hand side plot shows the dependence vs. N sample at N train = 192 and the right hand side plot shows that against N train with N sample = 648.It is clear that we have attained convergence against RXMC sampling cell size, as the results for N sample = 192 and 648 are virtually identical, and N sample = 24 also results in a decent prediction in the case of MgAl 2 O 4 .This is surprisingly small compared to the supercell sizes employed for MC calculations in the cluster expansion literature (48000 as mentioned above), but we have obtained essentially identical results for MgAl 2 O 4 and ZnAl 2 O 4 as seen in Fig. 4. On the other hand, the convergence with respect to the training supercell size N train is more difficult to confirm.Sizeable differences are seen for N train = 24 and 192, especially for MgGa 2 O 4 .This shows that the NNP trained in the 24-cation supercell of MgGa 2 O 4 is not capable of reproducing the correct energetics in larger supercells.Now, the ideal way to check whether we have achieved convergence at N train = 192 would be to perform active learning on a larger supercell, but that would be prohibitively expensive in terms of computer resources.As a compromise, we decided to test the accuracy of the N train = 192 NNP model by performing DFT relaxations  These results show that regardless of the model used (NNP or cluster expansion), the training set needs to include structures with periods that are long enough for the system under study.Also, if the model includes hyperparameters pertaining to the range of interactions, then that must be set long enough as well.For spinel oxide compounds examined here, the periods were found to be rather long but still tractable with the help of modern-day supercomputers.This may be surprising because our NNP model (or cluster expansion, for that mat- ter) does not consider truly long-range Coulomb interactions along the lines of, e.g., Ref. 62, as all explicit interactions are truncated by the cutoff in the descriptors.However, our results suggest that the cutoff is long enough, especially for lower energy or higher entropy structures encountered during the Monte Carlo procedure which should have relatively small polarization.The structural relaxation should also reduce the required cutoff due to dielectric screening.

C. Robustness against relaxation
One point worth stressing is that the success of the present approach in predicting the relaxed energies from ideal lattice structures is not a trivial consequence of negligibly small relaxation energies in this system.As shown in Fig. 8 for MgAl 2 O 4 , the relaxation energies are roughly one order of magnitude larger than energy differences between relaxed configurations (notice the difference in scales of the horizontal and vertical axes).Also, although the relaxation energies are correlated with energies before relaxation, the correspondence is not one-to-one and the scatter is as large as 50 meV/cation.This is clearly significant when comparing against relaxed energy differences between distinct configurations.
We also examine the magnitude of relaxation, as it has been shown to be correlated with prediction errors in the cluster expansion literature 12 .To quantify the amount of relaxation in the three spinel systems under equal footing, we use the normalized displacement, which we define as where ∆d is the atom displacement upon relaxation and V 0 is the volume per atom after volume relaxation; V 1/3 0 can be interpreted as an average lattice parameter, so the normalized displacement defined in this manner is an indicator of local strain.As shown in Fig. 9, the relaxation in the three spinel oxides are far from negligible.A significant number of atoms show ND's larger than 5%, and the maximum ND reaches ∼19% for ZnAl 2 O 4 .The correlation between relaxation and prediction errors are shown in Fig. 10, where the prediction RMSEs for the three systems are plotted vs. mean normalized displacement (MND), which is calculated by averaging the normalized displacements over all atoms from all configurations in the dataset (Fig. 9).Active learning calculations without relaxation were also performed on MgAl 2 O 4 to examine the prediction errors not attributed to relaxation.As a result, we see an increasing trend in the errors vs. MND, suggesting that relaxation does lead to higher errors.However, the RM-SEs are all less than 0.5 meV per atom, which is as good as we can reasonably expect when considering numerical noise in the relaxation process.
We also note that the normalized mean squared displacement (NMSD) was suggested in Ref. 12 as a relaxation metric that correlates somewhat with prediction errors, and they heuristically categorized NMSD values into "Good", "Maybe", and "Bad" regions in terms of expected reliablity and quality of cluster expansion fitting.The NMSDs for MgAl 2 O 4 , ZnAl 2 O 4 , and MgGa 2 O 4 calculated from the active learning procedure are 0.158%, 0.156%, and 0.147%, respectively, and they would all fall into the "Maybe" category.The fact that we were able to obtain RMSEs of less than 0.5 meV/atom for those systems indicates that our neural network scheme is at least as robust as cluster expansion against relaxation.Ultimately, we would need to accumulate experience in applying this approach to a wider variety of systems, but that is beyond the scope of this work.

D. Robustness with respect to input precision
As noted in Sec.II C, we designed our NNP model to be trained and used only on coordinates of the ideal lattice σ ∈ { σ lattice }.However, there is no way to specify the exact lattice coordinate due to the nature of floating point numbers used to represent non-integer values in computers.Thus, to see how robust our method is against deviations in the input coordinates, we examined the predictions of our final NNP model for randomly perturbed lattices (Fig. 11).We find that the predictions are virtually indistinguishable from predictions on the corresponding perfect lattice structures (that is, perfect within the numerical precision of our software stack) when each atom is perturbed by 0.01 Å.The NNP looks usable up to a perturbation of ∼ 0.05 Å, while at 0.1 Å, we start to observe sizable deviations.0.01 Å is not that small, and it should not be too difficult to make sure that the errors in the input coordinates fall within this range.The robustness may originate from the fact that the NNP maps similar input to similar energies, and because there are no other similar structures to choose from in the training set space except the corresponding ideal lattice structure.

MgAl₂O₄
ZnAl₂O₄ MgGa₂O₄ E. Applicability to hard-to-relax or frustrated systems Finally, we note that for truly complicated systems, local frustrations may lead to several local minima when starting the relaxation from ideal lattice structures.This means that the implicitly assumed one-to-one correspondence between lattice configuration and relaxed energy is broken.In such cases, it may be advisable to perform the active learning scheme for a few iterations, then use the final NNP in nested Monte Carlo [63][64][65][66][67][68] or upsampling schemes 69 , where MC or MD steps are carried out with low cost potentials then augmented with DFT calculations at preset intervals.These schemes are guaranteed to converge to the same result as sampling based only on DFT energies, and the speedup will depend on the quality of the low cost potential.The frustration problem mentioned above can be solved by nested DFT relaxations during the MC sampling: by randomly perturbing the structure slightly before relaxation, the system has a chance of visiting any of those local minima.This is in the spirit of basinhopping MC 70 .We plan on implementing and testing such nesting schemes in the near future.The results are plotted against the predictions using the unperturbed ideal lattice coordinates as input.

IV. SUMMARY
In this work, we proposed a new method to obtain lightweight models that can be used to sample millions of configurations in the ab initio lattice configuration problem: Behler-Parinello NNPs are trained to predict relaxed energies from perfect-lattice structures with configurational disorder, and the NNPs are iteratively refined using an active learning scheme involving several RXMC runs.The idea was tested successfully in calculating the degree of A/B site inversion in three spinel oxides: MgAl 2 O 4 , ZnAl 2 O 4 , and MgGa 2 O 4 .Merits compared to conventional continuous-coordinate NNP models include the ability to bypass structural relaxation, as well as the restriction on the input coordinate space that makes machine learning much easier.This approach may also be considered a 'lazy' alternative to cluster expansion; no explicit cluster selection is necessary, and it also avoids the combina-torial explosion in the amount of computation which makes cluster expansion practically unusable for many-component systems with multiple sublattices.We also demonstrated that the scheme is at least as robust as cluster expansion with respect to magnitude of local relaxation, and also in treating systems with long-range interactions.Due to the merits outlined above, we believe that this approach will enable configurational sampling in complex bulk or interface systems where it is technically difficult, if not impossible, to apply other previous approaches.

1 FIG. 1 .
FIG.1.A flowchart of our configurational sampling scheme using active learning.

FIG. 2 .FIG. 3 .
FIG. 2. Improvement in NNP predictions vs. DFT reference energies with the number of active learning iterations in the three spinel oxides, (a) MgAl 2 O 4 , (b) ZnAl 2 O 4 , and (c) MgGa 2 O 4 .RXMC runs are performed using the NNP, then DFT calculations are performed on structures from the same RXMC run to evaluate the prediction accuracy.The zero reference for the energy is taken to be the DFT energy for the ordered spinel configuration.Data on the diagonal line correspond to high prediction accuracy.The dashed circles indicate energy regions of randomly generated configurations in the first steps of each active learning RXMC iteration 55 .

FIG. 4 .
FIG. 4. The degree of inversion calculated in the present work for the three spinel oxides MgGa 2 O 4 (green), MgAl 2 O 4 (blue), and ZnAl 2 O 4 (orange) in the 192-cation cell whose neural network model was trained in the 192-cation cell (•).The results are also compared to the cluster expansion (CE) literature7,13,47 and our previous work using DFT calculations directly in combination with RXMC18 in a 48-cation model (lines).

FIG. 5 .
FIG. 5. (a)Comparison of NNP predictions vs. DFT reference energies when the NNP was trained on 1200 configurations generated by the active learning procedure (blue circles) or random sampling (green squares and orange crosses corresponding to two independent training runs on the same randomly generated dataset).The comparison is performed on the dataset taken from the third active learning iteration (same data as Fig.2a, red +).The inset shows all data points, while the main figure shows a magnified version for clarity.(b) The degree of inversion calculated for MgAl 2 O 4 using the three NNPs compared in (a).

FIG. 6 .
FIG. 6.The dependence of the calculated degree of inversions (DOIs) in (a) MgAl 2 O 4 and (b) MgGa 2 O 4 vs. training cell size (denoted by the number of cations in the training cell N train ) and MC sampling cell size (denoted by the number of cations in the sampling cell N sample ).

FIG. 9 .FIG. 10 .
FIG.9.Histograms of normalized displacements of atoms upon relaxation in the spinel oxides under study.The displacements are calculated by performing DFT relaxation on 300 structures from the fourth (final) active learning iteration, and the histograms are generated from all atoms in those structures.

FIG. 11 .
FIG. 11.Predicted NNP energies when the input coordinates of all atoms are perturbed in random directions with a magnitude of 0.1 Å (blue circles), 0.05 Å (orange crosses), and 0.01 Å (green triangles).The results are plotted against the predictions using the unperturbed ideal lattice coordinates as input.