GPU-Accelerated Approximate Kernel Method for Quantum Machine Learning

Conventional kernel-based machine learning models for ab initio potential energy surfaces, while accurate and convenient in small data regimes, suffer immense computational cost as training set sizes increase. We introduce QML-Lightning, a PyTorch package containing GPU-accelerated approximate kernel models, which reduces the training time by several orders of magnitude, yielding trained models within seconds. QML-Lightning includes a cost-efficient GPU implementation of FCHL19, which together can yield energy and force predictions with competitive accuracy on a microsecond-per-atom timescale. Using modern GPU hardware, we report learning curves of energies and forces as well as timings as numerical evidence for select legacy benchmarks from atomisitic simulation including QM9, MD-17, and 3BPA.


I. INTRODUCTION
Data-driven approximate machine learning (ML) methods have become increasingly prominent in theoretical chemistry in recent years [1][2][3][4][5] .In particular, supervised learning can be used to augment accurate, but computationally prohibitive electronic structure cala) Electronic mail: nickjbrowning@gmail.com culations.For tasks such as ab initio molecular dynamics (AIMD), approximate variantes of the electronic Schrödinger equation are solved for every coordinate update.Consequently, surrogate ML models which can partially substitute for the quantum calculations are extremely beneficial for reducing the overall computational burden (and carbon foot-print).In general, these ML models first transform atomic coordinates into an intermediate symmetry-preserving representation, which is then passed on to a non-linear machine learning model, usually one based on neural network (NNs) 6 or kernel methods 7 .
In particular, kernel methods, while accurate and straightforward to train, are typically marred by inferior computational efficiency.This is mainly due to the explicit dependence of the interpolation on every training item in the entire training data set which results in a large number of matrix-vector and matrix-matrix products.The issue becomes particularly cumbersome when gradients or higher order derivatives are to be included in the models' loss function.While some efforts have been made to reduce the computational cost of kernel based QML models [8][9][10][11] , there still remains the formally cubic scaling with the training data set size itself, implying an inherent numerical limitation.
In this work an approximate kernel method, Random Fourier Features (RFF) 12 is briefly discussed, and a more computationally efficient variant, termed Structured Orthogonal Random Features (SORF) 13 , is introduced.
These methods do not rely on all training points as basis functions when performing inference.Instead, they use a lower-dimensional feature map to approximate shift-invariant kernels which significantly improves the computational cost of evaluation.
We also provide a software package to per-

II. SOFTWARE AVAILABILITY
The QML-Lightning software is provided under an MIT licence at https://github.com/nickjbrowning/QMLightning.

III. THEORY
This section first summarises a subset of kernel methods to learn quantum mechanical properties.First, Gaussian Process Regression (GPR) 21,22 is introduced to learn both energies and forces of chemical compounds.
Then, operator quantum machine learning (OQML) 8,9 is discussed, and finally the approximate kernel methods random Fourier features (RFF) 12 and structured orthogonal random features (SORF) 13 are introduced.
In the following, upper case indices denote the index of chemical compounds in a dataset, while lower-case indices denote the index of atomic centres in each chemical compound.

A. Gaussian Process Regression (GPR)
In GPR kernel models of quantum mechanical properties, one constructs a linear model using a basis of kernel functions k(p j , •).For example, to learn potential energies a suitable functional form would be, where ρ J and ρ K are the atomic representations of atom J and K.We note that these atomic representations are functions of the set of local atomic charges and coordinates {z, r}, which has been omitted for brevity.
The sets J and K contain the set of atoms for training and query molecules j and k, respectively.The coefficients α i are obtained from the following regularised minimisation problem, which has the following solution in matrix form, where the hyperparameter λ is a small number in order to regularise and ensure numerical stability upon kernel inversion 23 .For learning potential energies and atomic forces simultaneously, one can construct an expression for the potential energy as follows, where matrix notation is introduced for simplicity, and ∂ ∂ r is used as a shorthand to stack the following derivatives, where l indexes the coordinate components from the from the query atom J.The hessian in equation 5 has the following form, where k indexes the coordinate components from training atom I.
where the index I runs over all atoms in the training set.This extends the number of regression coefficients to the number of atoms in the training set, rather than the number of chemical compounds as for GPR models.
Atomic forces can be included in the training scheme, resulting in the following equation in matrix form, We note that, unlike GPR models, the basis does not include gradient kernels when training on gradients is required; these derivatives only appear in the loss function as follows, This loss function is solved directly using a singular-value decomposition (SVD), in which singular values below a threshold min are ignored in the solution.By contrast to GPR which has O(36N 2 M 4 ) scaling, the heaviest term in the OQML kernel scales as O(6N 2 M 3 ).Note that both these models scale with N 2 in the training data, but differ with respect to pre-factor and with respect to scaling with system size.

C. Random Fourier Features (RFF)
In order to further reduce the explicit dependence of the model on the amount of training data when performing inference, Rahimi et.
al. 12 , introduced a lower dimensional lifting function z(x) to approximate the inner product synonymous with the kernel method, As a consequence of Bochner's theorem the Fourier transform of a shift-invariant kernel where N F is number of independent vectors w drawn from the probability distribution p(w).For different kernels, the distribution p(w) takes different forms, however for Gaussian kernels used here, p(w) is also Gaussian.
This formalism readily yields the following low dimensional feature map, = where Z ∈ R N train ×N F is the feature matrix corresponding to N train training observations.
The N F weights α are the solution to the following regularised normal equation, where the coefficients are obtained first by an LU decomposition of Z T Z + λI .To where G ∈ R N F ×d is a random Gaussian matrix, with the following transformation, where Q is a uniformly distributed random   Here we've found N PCA = 128 or 256 to be sufficient.
Finally, we note that there are a number of other approximate kernel methods which aim to reduce computational complexity, including other RFF-type approximations [24][25][26] , as well as those based on the Nyström method 27 , which relies on low-rank structure in the kernel matrix.Here, however, we've opted to use SORF due to its simplicity, computational efficiency and accuracy 26 .

E. Representation
In this work we use FCHL19 9 as the permutationally and rotationally invariant atomic environment featurisation layer.FCHL19 is an atom-centered representation consisting of two-and three-body elemental bins, similar in construction to the atom-centered symmetry functions (ACSFs) of Behler 28,29 .The functional form is briefly summarised here.
For every unique combination of two elements X , Y, the representation for each atom i is constructed as follows, where {Z, R} X ,Y refers to the set of atomic charges and coordinates that have either element X or Y.The two-body function is given by the following, where R s are the n 2 radial grid centres linearly distributed between 0 and r cut , and µ(r ij ) and σ(r ij ) are the parameters of the log normal distribution, where η 2 is a hyperparameter.The cutoff function to smoothly decay the representation to zero at r cut is defined as, The three-body term G 3-body is given body the following function, The radial term G 3-body radial is given by the following expression, where η 3 is a parameter that controls the width of the n 3 radial distribution functions, located at R s grid points.The three-body scaling function ξ 3 is the Axilrod-Teller-Muto term 30,31 with modified exponents 32 , where θ kij is the angle between atoms k, i, j, with i at the centre, c 3 is a weight term and N 3 is a three-body scaling factor.Finally, the angular term is given by a Fourier expansion, where the cosine and sine terms are given by, where   In figure IV A the predictive accuracy of several explicit kernel models and SORF models for atomisation energy of molecules in the QM9 dataset 14 are compared.These models include atomic SLATM 34 , FCHL18 32 and FCHL19 9 , all using the OQML 9 regressor.
For the SORF models, learning curves using both N F = 16384 and N F = 32768 are displayed.We find that the SORF models with included.
For energy learning, the SORF/FCHL19  39 and NequIP, with rotation orders l = 0 and l = 3.We note that FCHL19 is a comparatively simplistic atomic featurisation layer, and consequently it's expected that it does not perform as well as state-of-the-art equivariant many-body neural networks [40][41][42] such as NequiP.For a more reasonable comparison the l = 0 channel NequIP model, which contains at most 3-body terms similarly to FCHL19 has been included here.We note that while the force errors are similar to the MD-17 results, the energy errors are significantly lower across all models, which is consistent with the observation that the noise floor on the original MD-17 data is higher on the energies.

C. Flexible 3BPA Dataset
The 3PBA dataset 19,39 contains both ambient and high temperature configurational samples for the small drug-like molecule, 3-(bezyloxy)pyridin-2-amine (3BPA).This molecule has 3 central rotatable dihedral angles (α, β, γ) leading to a complex dihedral potential energy surface with many local minima, which can be challenging for both classical or ML-based potentials 43 .In particular, at ambient temperatures the accessible phase space is small, however the dataset contains both 600K and 1200K configurational samples, which have increasingly large phase space volumes.Therefore, a crucial test on whether an ML model can extrapolate well is if the model, when trained on the 300K samples, can accurately predict the 600K and 1200K samples.Consequently, we have trained the SORF/FCHL19 model on the 300K subset of the 3PBA dataset in order to compare against results from ACE, two kernel models sGDML and GAP 21 using SOAP 11,44 , as well as two related neural network architectures ANI 45,46 and ANI-2x 47 .
We note that these results have been summarised directly from the ACE publication 39 .
Table IV

D. Timings
Table IV   tiplication, and ≈10 and ≈700 times faster to train when using FP32 with Kahan's summation.On the A100 GPU, remarkably, trained models can be obtained in seconds.We note that this work focuses on models which can fit into GPU global memory, i.e both the Z T Z matrix and the normal equations are constructed and solved on the GPU.However, QML-lightning also supports an out-ofcore approach, where the Z T Z matrix is tiled, with each tile being copied to the host device and summed.Finally, the normal equations are then solved on the CPU.This second variant is necessary when the number of features yields matrices which exceed the memory required to both construct the Z T Z matrix, as well as solve the normal equations.In these cases, the choice of CPU(s) or out-of-core CPU/GPU implementation will significantly determine the model training time, however this has not been investigated here.In   Calculations were performed at sciCORE (http://scicore.unibas.ch/)scientific computing center at University of Basel.
form training and prediction of resulting Quantum Machine Learning (QML) models, termed QML-Lightning.It includes GPU implementations of both RFF and SORF models, as well as a GPU implementation of FCHL19 9 , an accurate atom-centered representation.QML-Lightning is built upon the PyTorch software package, with additional CUDA C implementations for critical components to improve its computational throughput.A thorough benchmark of the predictive accuracy of QML-Lightning has been performed on several established datasets of chemical compounds from literature, comparing against existing kerneland neural-network based models.To assess the models' performance across chemical compound space, we've benchmarked the model against the QM9 dataset 14 , which contains 134k small organic molecules with elements C, H, O, N and F. For dynamics and structural relaxation applications, we've benchmarked against both the MD17 15-17 , and rectified rMD17 18 datasets, which contain trajectories of 10 small organic molecules.Finally, to infer the models extrapolative performance QML-Lightning has been benchmarked against the challenging 3PBA dataset 19 , which contains three sets of MD trajectories at temperatures 300K, 600K and 1200K.We note that we have focused on energetic properties in this work, as these are the most critical for AIMD applications, however other QM properties can be used within the QML-Lightning framework straightforwardly.Finally, we provide training times for variety of systems, and prediction times for small molecules as well as periodic systems with up to ∼ 17k atoms.Note that while finalising work on this paper, Dhaliwal et al. 20 most recently published randomised feature-based interatomic potentials for molecular dynamics with promising results for CPU based applications.
include forces in the training scheme, the derivatives of the feature vectors l and k are the feature and coordinate component indexes, respectively, and stored in a derivatives feature matrix ∂Z ∈ R 3N total atom ×N F .The following regularised normal equation is then solved, (Z, ∂Z) T (Z, ∂Z) + λI α = (Z, ∂Z) T (E, F) (17) where the notation (Z, ∂Z) indicates the concatenation of the feature matrix Z with the derivative features ∂Z.The dominant term in constructing the normal equations is the ∂Z T ∂Z matrix product, which scales as O(3N train M N 2 F ), where N train is the number of training molecules and M is the average number of atoms per molecule in the training set.Note that the cost of constructing the normal equations is now linear in N train in both energy-only and energy and force learning.D. Structured Orthogonal Random Features (SORF) The above formulation revolves around computing the linear transformation Wρ I .Storing and computing this linear transformation has O(N F d) space and time complexity, where N F is the number of features and d is the size of the atomic representation vector ρ I .To reduce this space-time complexity, Yu. et al 13 introduced structured orthogonal random features (SORF).In this method, the matrix W is replaced by a special structured matrix consisting of products of random binary diagonal matrices and Walsh-Hadamard matrices.The resulting linear transformation has O(N F log d) time complexity and O(d) or O(1) space complexity, depending on implementation.Briefly, in the case of Gaussian kernel approximation, one can replace the transformation, orthogonal matrix (e.g obtained via QR decomposition of G) and S is a diagonal matrix with entries sampled i.i.d from the χdistribution with d degrees of freedom.The resulting matrix SQ is an unbiased estimator of the Gaussian kernel with low variance 13 .While this construction still has O(N F d) time complexity as well as the additional cost of computing the QR decomposition in a preprocessing step, one can further approximate this transformation as, 20) where S has first been replaced by a scalar √ d and the random orthogonal matrix Q has been replaced by a special type of structured matrix.The brackets indicate that this operation is repeated N transform times.The matrices D i ∈ R d×d are diagonal sign-flipping matrices, where each entry is sampled from a Rademacher distribution, and H is the Walsh-Hadamard matrix, when the number of features N F > d, the operation is simply re-peated N F d times, with the resulting vectors concatenated into a length N F vector.Crucially, the product W SORF ρ I now has time complexity O(N transform N F log d), since multiplication with H can be efficiently implemented via the fast Hadamard transform using in-place operations in O(d log d) time.Finally, since the Walsh-Hadamard matrix is only defined in R 2 n ×2 n , ρ I must also be projected into 2 n dimensions.This is achieved via an SVD decomposition on a subset of the atomic environment representations for each element e, concatenated into the matrix Ze , Ze = U e S e V T e (21) and the atomic representations are projected into a lower dimension via the following matrix product, ρ proj I = ρ T I U N PCA e (22) where only the first N PCA columns from the matrix U e are used.The subscript e indicates that the matrix Ze is built using only atomic representations of atom type e, hence each element has its own projection matrix U N PCA e .

F. Computational Details 1 .
ζ is a parameter describing the width of the angular Gaussian function, and n > 0 is the expansion order.Similarly to previous work, only the two n = 1 cosine and sine terms are used.Optimisation of Representation Parameters The optimal parameters to generate the FCHL19 representation differ here than in the original implementation 9 .While the energy + force parameters are the same, albeit with a lower cutoff of r cut = 6.0 Å, we have found improved energy-only parameters.To fit these parameters, we employed a subset of 576 distorted geometries of small molecules with up to 5 atoms of the type CNO, saturated with hydrogen atoms, for which forced and energies have been obtained from DFT calculations 8,33 .This dataset is identical to that used in the original FCHL19 publication.This dataset is randomly divided into a training set of 384 geometries and a test set of 192 geometries.Models are fitted to the training set, and predictions on the test set are used to minimize the following cost function with respect to the parameters,

threads are used in a 2
FIG. 1. Left: Convergence of out-of-sample errors for RFF and SORF models with N F , given a fixed budget of 1000 aspirin configurations, training on both energies and forces.Right: Convergence of out-of-sample errors for SORF models, with N transform = 1, 2 or 3.The QM9 dataset is used to measure model performance, using 1k, 10k and 75k training samples.Energy units are meV, force units are meV/ Å.

FCHL19
Figure IV B reports the energy and force MAE as a function of number of training samples, using 7 molecules from the MD-17dataset 15-17 .To be consistent with previous literature 9 , we use the unrectified MD-17 dataset, which is known to contain significant noise on the energy values 18 .The learning curves for SORF/FCHL19 with both N F = 16384, 32768 are reported.We compare against OQML models based on FCHL19 9 , as well as GDML 36 and sGDML 10 models.Additionally, SchNet 37 and state-ofthe-art NequIP 38 neural networks have been D shows the total time spent in seconds when training various CPU and GPUaccelerated models using both energies and forces, given a fixed amount of training data (N train = 1k) from the MD-17 database.The device used to train these models has been listed for reference.Three kernel models are shown, namely OQML and sGDML.We note that OQML and GPR models use the same representation as the GPU variant implemented here.We additionally compare against NequIP, however, the training times shown are an upper bound, as models with good predictive accuracy can likely be obtained with early stopping.The timings for the SORF model with both 16384 and 32768 features are shown, using a consumer RTX3080 GPU and scientific A100 GPU.It should be noted that there is a significant cumulative round-off error when constructing the Z T Z matrix in floating-point precision (FP32), which leads to ill-conditioning.However, Kahan's summation48 has been implemented here to minimise this error, allowing the normal equations to be constructed in FP32 format to yield a 3-fold reduction in training time comparatively to FP64 precision.For the A100 GPU, only FP64 performance is shown, as the peak FP64 FLOPs for this card is the same as its FP32 performance, namely due to its support of FP64 to FP64 matrix multiplication via its Tensor-Core architecture.Using the RTX3080 GPU, SORF models are several orders of magnitude faster to train than both GPR and neural networks.Furthermore, they are on average ≈3 and ≈210 times faster to train than OQML and sGDML models respectively across the dataset shown when using FP64 matrix mul- FIG. 4. Single-configuration force evaluation time and mean-absolute errors for a variety of models trained on 1K azobenzene configurations on a log-log scale.Parenthesised numbers show the number of features used approximate the kernel for the SORF model, and for ACE they show the number of basis functions.

(
16848 atoms) (refcode: VUJBEI) and liquid water (12000 atoms) are included, where periodic boundary conditions have been implemented using the minimum image convention.For small molecule, single configuration systems, the GPU is significantly underutilised, since each block handles a single atom, there are a significant number of idle streaming multiprocessors (SMs).For comparison, timings for simultaneously computing energies and forces for 1000 ethanol and aspirin configurations are presented.Here for the SORF-16384 model, the energy prediction time per configuration reduces from 3.2ms to 0.022ms for ethanol, and 3.1ms and 0.053ms for aspirin, respectively, clearly showing the effect of increasing atom counts on GPU utilisation.It should be noted that these timings include all CPU and GPU operations, therefore the creation of temporary matrices, host-device and device-host transfers, the device execution time itself as well as CPU and GPU overhead are all contained within.For the MOF system with 16848 atoms, excellent prediction times are obtained, requiring only 85.1ms and 118.3ms for N F = 16848 and N F = 32768, respectively, to compute energies and forces.This results in a cost of 5µs and 7µs per atom respectively.For liquid water with 12000 atoms, the total time comparatively increases to 136.4ms and 155.3ms for N F = 16848 and N F = 32768, while the force computation times increase to 11.3µs and 12.9µs per-atom, respectively.

FIG. 5 .
FIG. 5. Average GPU inference (on-device) timings of component functions for energy and force evaluation for A) a single aspirin configuration and B) a single Cu 225 Zn 20 brass cluster configuration and C) a configuration of myoglobin, with R cut = 6.0 Å and N F = 32768.

TABLE II .
Energy and force MAEs for models trained on 1k configurations from the revised MD-17 dataset.Errors are reported in meV and meV/ Å for energies and forces, respectively.
F = 32768.However, in all other cases, the SORF models display similar or better accuracy.both sGDML and GDML perform worse than OQML and SORF mod-els in general, however for toluene and naph-F = 32768.We note that these models outperform SchNet in all cases, while NequIP out performs the SORF models in all cases.Additionally, as discussed in C lists the energy and force root- timised here beyond reducing the cutoff, contains only up to 3-body terms.Therefore, one would expect ACE to outperform the SORF models in this setting.

TABLE IV .
Training time in seconds for various kernel (CPU) and neural network (GPU) models, using N train = 1k and training on both energies and forces.Device used to train the model indicated where appropriate.For NequIP, training times are approximate.For the QML-Lightning models, both double precision (DP) and single precision (FP) have been specified for the RTX3080 device, whereas only DP has been specified for the A100 (see discussion for details).