Symplectic Gaussian Process Regression of Hamiltonian Flow Maps Symplectic Gaussian Process Regression of Hamiltonian Flow Maps

We present an approach to construct appropriate and efficient emulators for Hamiltonian flow maps. Intended future applications are long-term tracing of fast charged particles in accelerators and magnetic plasma confinement configurations. The method is based on multi-output Gaussian process regression on scattered training data. To obtain long-term stability the symplectic property is enforced via the choice of the matrix-valued covariance function. Based on earlier work on spline interpolation we observe derivatives of the generating function of a canonical transformation. A product kernel produces an accurate implicit method, whereas a sum kernel results in a fast explicit method from this approach. Both correspond to a symplectic Euler method in terms of numerical integration. These methods are applied to the pendulum and the Hénon-Heiles system and results compared to an symmetric regression with orthogonal polynomials. In the limit of small mapping times, the Hamiltonian function can be identified with a part of the generating function and thereby learned from observed time-series data of the system’s evolution. Besides comparable performance of implicit kernel and spectral regression for symplectic maps, we demonstrate a substantial increase in performance for learning the Hamiltonian function compared to existing approaches.

Both correspond to a symplectic Euler method in terms of numerical integration. These methods are applied to the pendulum and the Hénon-Heiles system and results compared to an symmetric regression with orthogonal polynomials. In the limit of small mapping times, the Hamiltonian function can be identified with a part of the generating function and thereby learned from observed time-series data of the system's evolution. Besides comparable performance of implicit kernel and spectral regression for symplectic maps, we demonstrate a substantial increase in performance for learning the Hamiltonian function compared to existing approaches. a) Electronic mail: katharina.rath@ipp.mpg.de

I. INTRODUCTION
Many models of dynamical systems in physics and engineering can be cast into Hamiltonian form. This includes systems with negligible dissipation found in classical mechanics, electrodynamics, continuum mechanics and plasma theory 1-3 as well as artificial systems created for numerical purposes such as Hybrid-Monte-Carlo algorithms 4 for sampling from probability distributions. A specific feature of Hamiltonian systems is their long-term behavior with conservation of invariants of motion and lack of attractors to which different initial conditions converge. Instead a diverse spectrum of resonant and stochastic features emerges that has been extensively studied in the field of chaos theory 5 . These particular properties are a consequence of the symplectic structure of phase space together with equations of motion based on derivatives of a scalar field -the Hamiltonian H.
Numerical methods that partially or fully preserve this structure in a discretized system are known as geometric or symplectic integrators 6 . Most importantly such integrators do not accumulate energy or momentum and remain long-term stable at relatively large time-steps compared to non-geometric methods. Symplectic integrators are generally (semi-)implicit and formulated as (partitioned) Runge-Kutta schemes that evaluate derivatives of H at different points in time.
The goal of this work follows a track to realize even larger time-steps by interpolating the flow map 7-11 describing the system's evolution over non-infinitesimal times. For this purpose some representative orbits are integrated analytically or numerically for a certain period of time. Then the map between initial and final state of the system is approximated in a functional basis. Once the map is learned it can be applied to different states to traverse time in "giant" steps. Depending on the application this can substantially reduce computation time. When applied to data from measurements this technique allows to learn the dynamics, i.e. the Hamiltonian of a system under investigation 12 .
Naively interpolating a map in both, position and momentum variables destroys the symplectic property of the Hamiltonian flow. In turn, all favorable properties of symplectic integrators are lost and subsequent applications of the map become unstable very quickly. This problem is illustrated in Fig. 1, where the flow map of a pendulum is interpolated in a symplectic and a nonsymplectic manner, respectively. If one enforces symplecticity of the interpolated map by some means, structure-preservation and long-term stability are again natural features of the approximate map. Here this will be realized via generating functions introduced by Warnock et al. 8,11 in this context. This existing work relies on a tensor-product basis of Fourier series and/or piecewise spline polynomials. This choice of basis has two major drawbacks: rapid decrease of efficiency in higher dimensions and limitation to box-shaped domains. One possibility to overcome these limitations would be the application of artificial neural networks with symplectic properties [13][14][15][16] .
Here we rather introduce a kernel-based method as a new way to construct approximate symplectic maps via Gaussian process (GP) regression and radial basis functions (RBFs). In the results it will become apparent that such a method can work with much less required training data in the presented test cases.
GP regression 17 , also known as Kriging, is a flexible method to represent smooth functions based on covariance functions (kernels) with tunable parameters. These kernel hyperparameters can be directly optimized in the training process by maximizing the marginal likelihood. Predictions, in particular posterior mean and (co-)variance for function values, are then made via the inverse kernel covariance matrix. Observation of derivatives required to fit Hamiltonian flow maps is possible via correlated multi-output Gaussian processes [18][19][20][21] . During the construction, the close relation to (symmetric) linear regression 22 and non-symmetric mesh-free collocation methods will become apparent 23,24 .
The paper is structured as follows: First, Hamiltonian systems and canonical transformations that preserve the symplectic structure of phase space are introduced. Then, general derivations of multi-output Gaussian processes with derivative observations in the context of symmetric meshless collocation and of non-symmetric collocation with radial basis functions or orthogonal poly-nomials is given followed by a presentation of two algorithms to construct and apply symplectic mappings using Gaussian processes. Finally, the presented methods are tested on a simple pendulum and the more complex Hénon-Heiles system and compared to non-symmetric collocation using radial basis functions as well as linear regression using an expansion in Hermite polynomials combined with a periodic Fourier basis.

II. DYNAMICAL HAMILTONIAN SYSTEMS AND SYMPLECTIC FLOW MAPS
Hamiltonian mechanics describe motion of a dynamical system in phase space, that has the structure of a symplectic manifold 2 . In the Hamiltonian formulation the equations of motion are of first order, so trajectories in phase space can never intersect each other. For perturbation theory and to understand the general character of motion, the Hamiltonian point of view can yield important insights and keep the underlying symplectic phase space structure.

A. Hamiltonian systems
A f -dimensional classical mechanical system is fully characterized by its Hamiltonian function H(q, p,t), which is a function of f generalised coordinates q, f generalised momenta p and time t. The time evolution of an orbit (q(t), p(t)) in phase space is given by Hamilton's canonical equations of motionq The Hamiltonian flow ϕ H intuitively describes the map that transports the collection of all phase points along their respective orbits that are uniquely defined by initial conditions. More precisely, the derivatives of H define a Hamiltonian vector field X H (q, p) on the cotangent bundle T * Q of configuration space 2 , where the Poisson tensor in canonical representation is an 2 f × 2 f antisymmetric block matrix with I the f × f unit diagonal matrix. Evolving a system along X H (q, p) over finite time intervals yields the Hamiltonian flow map ϕ H , and preserves the symplectic structure of phase space 2 . The integral curves of X H (q, p) are solutions to the equations of motion given in Eq. 2. Important properties follow directly from the preservation of the symplectic structure of T * Q: conservation of invariants such as energy and momentum, and volume preservation (as proven by Liouville's theorem) in phase space 2 . The latter means, if some region Ω is evolved according to the Hamiltonian flow ϕ H , the volume of ϕ H (Ω) remains constant.
In a differential sense this means that the flow in phase space is divergence-free:

B. Canonical transformations
One is usually interested in the temporal evolution according to Eq. 3, that is, position Q and momentum P of a system at time t = t 1 that has been initialized with position q and momentum This property is closely related to divergence-or curl-freeness of vector fields. Similar to using a scalar or vector potential to guarantee such properties, symplecticity (Eq. 5) can be automatically fulfilled by introducing a generating function F(q, P ). This function links old coordinates (q, p) to new coordinates (Q, P ) via a canonical transformation. For a type 2 generating function 1 this canonical transformation is given by As the kernel regression of a linear term is less favorable, the generating function that maps several timesteps in the evolution of the Hamiltonian system is split into a sum, It's easy to check that the first part q · P in Eq. 8 describes the identity transformation q → Q, p → P . The relation between (q, p) and (Q, P ) can be written as As any transformation via a generating function is canonical per definition 1 , also the mapping created using a generating function preserves the symplectic structure of phase space.
In the limit of small mapping times, the Hamiltonian H can be identified (up to a constant) with the (differential) generating functionF, as time evolution is considered to be an infinitesimal canonical transformation. Namely, from Eq. 9, we obtain the following expressions for (Q, P ): This can be compared by the equations of motion in Eq. 2, where the first order approximation yields a symplectic Euler integration step Comparing those sets of equations yields the relationF = H∆t up to an irrelevant constant shift, where ∆t = t 1 − t 0 is the mapping time.

A. Derivative observation in non-symmetric collocation
Collocation and regression via basis functions approximates observed function values g(x) ∈ R D for x ∈ R d by fitting a linear combination of the chosen basis ϕ i (x), where α i are the weights and n is the number of basis functions. Suitable bases are e.g. orthogonal polynomials, splines, trigonometric functions (Fourier series) or radial basis functions with kernels In order to compute the weights α i , we solve where Φ ∈ R N×n : Φ i j = ϕ j (x i ) for N training points and X is the d × N design matrix. The interpolant g * at any point x * is given by When dealing with derivative observations, applying a linear operator L, e.g. differentiation, to Eq. 14 yields which, when combining g(x) with Lg(x), results in the collocation matrix with Ψ i j = Lϕ j (x i ). In case of radial basis functions (RBFs), L is applied on the kernel function ϕ(x i , x j ) only once in the first argument. The resulting linear system is usually overdetermined and has to be solved in a least-squares sense. When applied to partial differential equations the resulting method is called non-symmetric or Kansa's method 23,24 .

B. Derivative observation in (symmetric) linear regression
In order to obtain a directly invertible positive definite system, one may instead use a symmetric least-squares regression method in a product basis 22 . Multiplying Eq. 14 by ϕ j (x k ), and subsequently summing over N observations, we arrive at a minimization problem corresponding to Ax = b, with When derivative observations are considered, the basis changes to ψ i = Lϕ i , resulting in a Here, A is a symmetric positive definite matrix, that is directly invertible.

C. Multi-output GPs and derivative observations
A Gaussian process (GP) 17 is a stochastic process with the convenient property that any finite marginal distribution of the GP is Gaussian. For x ∈ R d , a GP with mean m(x) and kernel or where we allow vector-valued functions 21 . In contrast to a single output case, where the random variables are associated to a single process for f (x) ∈ R, a multi-output GP for f (x) ∈ R D consists of random variables associated to different and generally correlated processes. The covariance function is a positive semidefinite matrix-valued function, whose entries (K(x, x )) i j express the covariance between the output dimensions i and j of f (x). In case a linear model for the mean m with some functional basis ϕ i and unknown coefficients is used, a modified Gaussian process follows according to Rasmussen&Williams 17 , chapter 2.7.
For regression via a GP we assume that the observed function values Y ∈ R D×N may contain local Gaussian noise ε with covariance matrix Σ n , i.e. the noise is independent at different position x but may be correlated between components of y = f (x) + ε. The input variables are aggregated in the d ×N design matrix X. After observing Y , the posterior mean F * ≡ E(F(X * )) and covariance evaluated for validation data X * is given analytically by where Σ n ∈ R D×D is the covariance matrix of the multivariate output noise. In the simplest case it is diagonal with entries Σ ii n = σ 2 n . Estimation of kernel parameters and Σ n given the input data is usually performed via optimization or sampling according to the marginal log-likelihood.
When a linear operator L, e.g. differentiation, is applied to the Gaussian process, this yields a new Gaussian process 18,19,25 , Here the mean l(x) is given by l(x) = Lm(x) and a matrix-valued gradient kernel follows, where L T x is applied from the right to yield an outer product 26 . As differentiation is a linear operation, in particular the gradient of a Gaussian process over scalar functions g(x) with kernel k(x, x ) remains a Gaussian process. The result is a multioutput GP where the covariance matrix is the Hessian of K(x, x ) containing all second derivatives in (x, x ). A joint GP, describing both, values and gradients is given by with n(x) = (m(x), l(x)) T and where contains L(x, x ) as the lower-right block. In the more general case of a linear operator L, one may use the joint GP in Eq. 31 as a symmetric meshless formulation 23 to find approximate solutions of the according linear (partial differential) equation.

D. Symplectic GP regression
To apply GP regression on symplectic maps we use Eq. 31 for the joint distribution of the generating function and its gradients in Eq. 9, We cannot observe the generating functionF(q, P ), but it is determined up to an additive constant via the predictorF * =   ∂ q k(X * , X) where column i of X for the i-th training orbit is composed of rows and similarly for X * and test points. Columns of Y contain The matrix L denotes the lower block This also allows to learn the Hamiltonian H from Eq. 34 as for sufficiently small mapping times H can be approximated byF (up to a constant).
For further investigations on temporal evolution of the Hamiltonian system and the construction of symplectic maps, we are interested in the gradients ofF via the block L. The predictive mean for this GP's output is given by Let's now check the symplecticity condition for predictors p * = P * − ∆p * (q * , P * ) and Q * = q * + ∆q * (q * , P * ) according to Eq. 5. The derivatives of linear terms P * and q * vanish and by using Eq. 38, the remaining derivatives enter upper and lower rows of L(X * , X), respectively, Due to symmetry of partial derivatives, the expected value of the symplecticity condition in Eq. 39 is identically zero, so the predictive mean of Eq. 38 produces a symplectic map. Due to the mixing of initial and final conditions by such a map, we can generally not predict Q * and P * for a given q * , p * right away. Depending on the choice of the kernel, two cases have to be considered: a. Semi-implicit method In the most general case with a generating functionF(q, P ), equations for P * in Eq. 38 are implicit and have to be solved iteratively as indicated in Algorithm 1.
This corresponds to the implicit steps of a symplectic Euler integrator in a non-separable Hamiltonian system 6 .
b. Explicit method When considering a generating function in separable formF(q, P ) = V (q) + T (P ), the resulting transformation equations reduce to resulting in a simplified lower block L(q, P , q , P ) with off-diagonal elements equal to 0: This choice of generating function results in a simplified construction and application of the symplectic map as explained in Algorithm 2. This corresponds to the case of a separable Hamiltonian where the symplectic Euler method becomes fully explicit. The hope is that such a separable approximation is still possible also for non-separable systems, thereby resulting in a trade-off between performance and accuracy compared to the semi-implicit variant.

IV. NUMERICAL EXPERIMENTS
The described symplectic regression methods for flow maps are benchmarked for two Hamiltonian systems: the pendulum and the Hénon-Heiles system. In addition, also the Hamiltonian function H is predicted using GPs following Eq. 34. We compare 1. implicit symplectic GP (Algorithm 1), 2. explicit symplectic GP (Algorithm 2), 3. implicit symmetric regression with a spectral basis, 4. implicit non-symmetric collocation with RBFs.
To assess the quality and stability of the proposed mapping methods, two quality measures are used. The geometric distance is computed to compare the first application of the constructed map step to the respective timestep of a reference orbit in phase space. This phase space distance is given by where q ref and p ref denote the reference orbits and q and p the mapped orbits. The reference orbits are calculated using an adaptive step-size Runge-Kutta scheme.
As energy is a constant of motion and should be preserved, the normalized energy oscillation given by whereH is the mean, serves as a criterion for mapping quality. Here, the energy oscillations are averaged over the whole considered time: k = 300 subsequent applications of the map.

A. Pendulum
The pendulum is a nonlinear oscillator with position q and momentum p and exhibits two kinds of motion: libration and rotation, which are separated by the separatrix in phase space (Fig. 2). The system of a pendulum corresponds to a particle trapped in a cosine potential 5 . The Hamiltonian is given by from which the equations of motion follow, The underlying periodic topology suggest the choice of a periodic kernel function in q and a squared exponential kernel function in p: The hyperparameters, l q and l p that correspond to the length scales in q and p respectively, are set to optimized maximum likelihood values by minimizing the negative log-likelihood 17 . For the product kernel k(q, q i , P, P i ) = σ 2 f V (q, q i )T (P, P i ) used in the implicit method, the noise in observations σ 2 n (Eq. 27) is set to 10 −16 , whereas for the sum kernel k(q, q i , P, P i ) = σ 2 f (V (q, q i ) + T (P, P i )) for the explicit method, σ 2 n = 10 −10 . The scaling of the kernel σ 2 f that quantifies the amplitude of the fit is set in accordance with the observations is fixed at 2 max(|Y |) 2 , where Y corresponds to the training output. For the spectral method, an expansion in Fourier sine/cosine harmonics in q and Hermite polynomials in p was chosen to account for periodic phase space.
The number of modes and the degree depend on the number of training points to obtain a fully or overdetermined fit.
The Hamiltonian function H given in Eq. 51 is learned from the initial (q, p) and final state (Q, P) using Eq. 34. In contrast to earlier proposed methods 12 , derivatives of H are not needed explicitly as the training data consists only of observable states of the dynamical system in time. We use a fully symmetric collocation scheme with a product kernel k(q, q i , P, P i ) ∝ V (q, q i )T (P, P i ). In we get for the training loss 1.3 · 10 −5 and for the validation loss 6.3 · 10 −5 . As evident in Fig. 3, the extrapolation ability is restricted to areas close to the range of the training data in p, whereas longer extrapolations in q are possible due to the periodicity of the system and the use of a periodic kernel function.
To evaluate the different mapping methods, we use training data points sampled from a Halton sequence within the range q ∈ [0, 2π] and p ∈ [−2.5, 2.5]. The 100 validation data points as shown in Fig. 2 are chosen randomly within the range q ∈ [π − 2.8, π + 1.5] and p ∈ [−2.3, 1.8] and include several possible unstable points as they are close to the separatrix. Naturally, motion near the separatrix is unstable, as a small deviation from the true orbit can result in a change of kind of motion. As this does not represent physical behaviour, those unstable points are removed from the final results.

B. Hénon-Heiles Potential
The Hénon-Heiles problem is a classical example of nonlinear Hamiltonian systems with f = 2 degrees of freedom and a 4D phase space 5 . The Hamiltonian for the Hénon-Heiles problem is given by where we will fix λ = 1. The equations of motion follow directly aṡ The underlying potential continuously varies from a harmonic potential for small values of q 1 and q 2 to triangular equipotential lines on the edges. For energies lower than the limiting potential energy H esc = 1/6, the orbit is trapped within the potential. However for larger energies, three escape channels appear due to the special shape of the potential, through which the orbit may escape 27 . Therefore, the training and validation data is set to a restricted area in phase space in order to keep the motion bounded within the potential.
Here, a squared exponential kernel function is used in all dimensions, where the hyperparameter l is set to its optimized maximium likelihood value. The noise in observations σ 2 n (Eq. 27) is set to 10 −16 for the implicit method and to 10 −10 for the explicit method. As in the pendulum case,

V. DISCUSSION
In Fig. 8 and Fig. 9, the four methods are compared for the one dimensional pendulum and the Hénon-Heiles problem, respectively, using the quality criteria given in Eq. 49 and Eq. 50 for a fixed mapping time t/τ b , where τ b is the linearized bounce time, but a variable number of training points N. As expected, the geometric distance g as well as the energy oscillation E osc are increasing for a increasing mapping time. Since no Newton iterations are needed, the explicit GP method is faster than the implicit method in it's region of validity. As for the first guess for Newton's method a separate GP is used as indicated in Algorithm 1, less then 5 Newton iterations are typically necessary in the implicit case.
As apparent in Fig. 5 for the 1D case, the orbits in phase space resulting from the explicit method with GPs are tilted which explains the bad performance regarding the geometrical distance.
For smaller mapping times, the tilt and therefore also the energy oscillation reduces. This is in accordance with similar behavior of explicit-implicit Euler integration schemes. More severely, the explicit GP loses long-term stability at increasing mapping time t/τ b in the Hénon-Heiles system due to certain orbits that escape the trapped state after a few 10-100 applications of the map. This severely limits the applicability range of the explicit GP method in its current state.
Spectral linear regression produces very accurate results for very small mapping times, as the interpolated data (the change in coordinates) is almost 0, and the generating function inherits the polynomial structure of the Hamiltonian H that can be fitted exactly. At larger mapping times, implicit GP regression and spectral methods perform similarly, with somewhat better results of the implicit GP for the pendulum, and higher accuracy of the spectral fit for the Hénon-Heiles system.
To investigate the behavior with increasing number of N, in Fig. 6 for the pendulum and Fig. 7 for the Hénon-Heiles system, the quality measures are compared for fixed mapping time t/τ b , but with a variable number of training points. The explicit method does not improve with N due to the tilt in phase space, that is an inherent structural feature of the forced splitting into a sum kernel that cannot be fixed by adding more training data. The implicit methods improve considerably with N.
While the explicit GP method remains stable in the pendulum, it again becomes unstable already  The results for the explicit method with GPs are cut as the method becomes unstable for larger mapping times.

VI. CONCLUSION AND OUTLOOK
In this paper, we have presented a novel approach to represent symplectic flow maps of Hamiltonian dynamical systems using Gaussian Process regression. Intended applications are long-term tracing of fast charged particles in accelerators and magnetic plasma confinement configurations.
A considerable advantage compared to existing methods in a spline or spectral basis is the possibility of using input data of arbitrary geometry with GPs. Moreover, the presented method uses considerably less training data than neural networks in the presented test cases. The concept was validated on two Hamiltonian systems: the 1D pendulum and the 2D Hénon-Heiles problem.
An implicit approach was shown to yield similar accuracy to linear regression in a spectral basis, whereas an explicit mapping requires no iterations in application of the map at the cost of accuracy and stability. Observation of training data within a short period of time allows for an accurate interpolation and even extrapolation of the Hamiltonian function H using substantially less training points compared to existing methods.
To increase the accuracy of the symplectic mappings as well as the prediction of H, higher order implicit methods analogous to symplectic schemes such as as midpoint, Gauss-Legendre or higher order RK schemes, could be investigated in the future. Especially the explicit method in combination with a Verlet scheme seems promising to leverage fast computation and the possible higher accuracy. This may also aid to overcome stability issues faced in the Hénon-Heiles system due to reduced training error with less risk of overfitting. Another important next step could be the investigation of maps between Poincaré sections. This will allow to increase mapping time even further, but require prior separation of regions with different orbit classes in phase space. In the future higher dimensional systems as well as chaotic behavior could be investigated based on such Poincaré maps.