The $\textit{u}$-series: A separable decomposition for electrostatics computation with improved accuracy

The evaluation of electrostatic energy for a set of point charges in a periodic lattice is a computationally expensive part of molecular dynamics simulations (and other applications) because of the long-range nature of the Coulomb interaction. A standard approach is to decompose the Coulomb potential into a near part, typically evaluated by direct summation up to a cutoff radius, and a far part, typically evaluated in Fourier space. In practice, all decomposition approaches involve approximations---such as cutting off the near-part direct sum---but it may be possible to find new decompositions with improved tradeoffs between accuracy and performance. Here we present the $\textit{u-series}$, a new decomposition of the Coulomb potential that is more accurate than the standard (Ewald) decomposition for a given amount of computational effort, and achieves the same accuracy as the Ewald decomposition with approximately half the computational effort. These improvements, which we demonstrate numerically using a lipid membrane system, arise because the $\textit{u}$-series is smooth on the entire real axis and exact up to the cutoff radius. Additional performance improvements over the Ewald decomposition may be possible in certain situations because the far part of the $\textit{u}$-series is a sum of Gaussians, and can thus be evaluated using algorithms that require a separable convolution kernel; we describe one such algorithm that reduces communication latency at the expense of communication bandwidth and computation, a tradeoff that may be advantageous on modern massively parallel supercomputers.


I. Introduction
Evaluation of the electrostatic energy for a set of point charges in a periodic lattice is required for a number of applications, including molecular dynamics (MD) simulations, electronic structure calculations, crystallographic calculations, and hydrodynamic simulations. The long-range nature of the Coulomb interaction leads to well-known computational challenges arising from the inefficiency of the direct summation of all pairwise interactions. [1][2][3] For MD simulations of biophysical systems in aqueous solution with explicitly modeled solvent 4 -the application that motivates us here-the unscreened Coulomb interactions are non-negligible even at distances as large as 100 Å, and thus extend over the entirety of most simulated systems. Furthermore, when using spatial-domain decompositions 5 on massively parallel computers, the long range of the Coulomb interaction leads to communication latency that limits parallel scaling. 6,7 Many efficient approaches for evaluating the total electrostatic energy are based on Ewald's decomposition, 2,8 which divides the Coulomb potential function 1 / r into short-range and longrange parts, also referred to as the "near" and "far" parts, respectively. The near part is effectively range-limited, and is computed by direct summation up to a cutoff radius rc. The far part is typically computed as a sum in k-space (also known as reciprocal or Fourier space), where it too is effectively range limited, and can be efficiently computed by methods such as optimized Ewald summation, 9,10 the particle-mesh Ewald algorithm (PME), 11,12 the particle-particle particle-mesh method, 13 the fast Fourier Poisson method, 14 or Gaussian split Ewald (GSE). 15 The near interactions in the Ewald decomposition decay rapidly but do not vanish at rc, leading to various undesirable artifacts depending on the specific implementation. The typical approach of abruptly setting the near forces to zero beyond rc is equivalent to truncating the near electrostatic energy and shifting it downwards so that it vanishes continuously at the cutoff. This is known as the "cut-and-shift" approach, 16 and it introduces a small, constant error in the electrostatic energy at distances up to rc. More sophisticated techniques based on switching functions have been applied to the Ewald near term to increase its smoothness at rc; such techniques eliminate the error in the electrostatic energy up to the distance at which the switching function takes effect, thus localizing the error near rc, but have not been found to measurably reduce errors in actual biomolecular simulations. 17 There has also been some work exploring decompositions based on screening charge distributions, 18-20 which can be made continuous at rc, but which still exhibit errors at distances less than rc.
In the present work, we introduce the u-series, a new decomposition of the Coulomb potential into near and far parts that is constructed to be exact up to the cutoff and continuous at the cutoff.
The far part of the u-series is essentially the long-range portion of a bilateral series (a series whose summation index runs from  to ) presented by Beylkin and Monzón, 21,22 who observed that the function 1 / r can be well approximated with uniform relative error by a bilateral series of suitably scaled Gaussians. (The name "u-series" was chosen to reflect this property of uniform relative error). The near part is 1 / r minus the far part, truncated at rc (hence the near and far parts sum to exactly 1 / r at distances up to rc). We construct the decomposition in such a way that the near part vanishes and is continuously differentiable at rc, without recourse to approximations that perturb the total energy (e.g., shifting or switching). As a result, the magnitude of the Coulomb interaction error near rc is significantly decreased relative to the Ewald decomposition.
Our approach was motivated by the hypothesis that moving the electrostatic energy error from before rc (as in the Ewald decomposition) to beyond rc would improve overall accuracy due to self-cancellation of long-range errors in a charge-neutral system. This intuition is borne out empirically by numerical experiments in which we find a substantial improvement in the accuracy of simulation observables for a lipid membrane system. Alternately, the u-series can match the accuracy of the Ewald decomposition with reduced computational effort; based on our numerical tests, we estimate that u-series needs only about half the computational effort of Ewald for the same accuracy. In practice, these computational savings are effected by using a smaller cutoff radius for the near part and/or a coarser grid when employing a mesh-based method to evaluate the far part.
In certain situations, the u-series may have certain additional advantages: Its far part is a sum of Gaussians, which are separable functions (that is, products of one-dimensional functions; e.g., g(x,y,z) = f(x)f(y)f(z)), allowing its far part-unlike the far part of the Ewald decomposition-to be evaluated by algorithms that take advantage of this separability. Three-dimensional convolutions of separable kernels are computationally less expensive than convolutions of nonseparable kernels, 23 and we will show that they also admit implementations with lower communication latency on parallel machines. In particular, we present a real-space algorithm that evaluates the far part with a lower communication latency than standard Fourier space methods, at the expense of some overhead in communication bandwidth and computation. Such a tradeoff may be beneficial on massively parallel supercomputers, providing an additional potential performance advantage for the u-series beyond that which is illustrated in our numerical experiments.

II. The Problem of Evaluating Electrostatic Interactions
We begin by reviewing the problem of evaluating the electrostatic energy for periodic systems.
We briefly discuss Ewald's method and refer the reader to refs. 1-3 for additional details. For an 5 in-depth treatment of the structure of the electrostatic potential for periodic charge distributions, we refer the reader to the original analysis of de Leeuw et al. 24 Let ρ(r) be a charge density that satisfies periodic boundary conditions ρ(r) = ρ(r + n), where n denotes a vector (mxLx, myLy, mzLz) with mx, my, mz integers, and Lx, Ly, Lz the side lengths of an orthorhombic unit cell of volume V = LxLyLz.. The unit cell, denoted by Ω, is assumed to be charge neutral (that is, ∫ ( ) = 0). The charge density ( ) can be continuous, discrete, or a combination of the form where ρc(r) is the continuous part of the distribution, qi denotes the magnitude of the point charge at position ri within the unit cell, and δ is the Dirac delta function. The electrostatic potential Φ( ) generated by the infinite, periodic distribution of charge is formally given by The potential defines the electrostatic energy ℰ of the unit cell by the formula As is customary, the prime in the integral of Eq. (3) means that the potential has the Coulomb singularity at ri removed when interacting with the point charge qi (which itself was the source of the singularity). It can be shown that Φ(r) can be expressed as where σ > 0 is an arbitrary length scale. The splitting can be physically rationalized 18,19 by canceling the charge distribution ρ(r) in Eq. (2) with a smoother version of itself obtained by convolution with a Gaussian of width (standard deviation) σ. The second term in Eq. (6) (the far part) then represents the electrostatic potential generated by a unit charge that is distributed as a Gaussian of width σ centered about the origin. It can be verified that 7 which shows that the far part in the Ewald decomposition is a continuous superposition of Gaussians of width at least σ. Similarly, the near part is a superposition of Gaussians of width at most σ.
The near and far parts of the Ewald decomposition are substituted for the Coulomb interaction in Eq.
(2), resulting in the near and far potentials, respectively. The integral involving the near potential converges absolutely in real space, where it is truncated at a suitable cutoff radius rc and computed as an explicit sum of pairwise interactions.
The far part of the electrostatic potential, generates an absolutely convergent sum in k-space, which takes the form The far part of the electrostatic energy is then defined by where the constant ∑ term subtracted in Eq. (10) is the self-interaction energy.
The cost of computing the near potential is proportional to the number of particles and to the volume of interaction 4πrc 3 / 3 for each particle. The cost of evaluating the k-space sum depends on the specific method used. For direct Ewald summation, a k-space cutoff kc is set to a fixed multiple of 1 / σ, and the number of k-space lattice points in the ball of radius kc is proportional to the volume 4πkc 3 / 3. For grid-based techniques such as PME and GSE, the mesh spacing of the grid is proportional to σ (for a given amount of accuracy), and computational effort scales at least linearly with the number of grid points. For both methods, the cost of computing the far potential is thus at least proportional to 1 / σ 3 .

III. The u-series
In this section we introduce the u-series, a new decomposition of the Coulomb potential for Ewald-like electrostatics computations. Our starting point is a previously known 21,22 bilateral series approximation for 1 / r, which we study in Section IIIA. In Section IIIB we show how this bilateral series approximation leads directly to a decomposition of 1 / r. In Section IIIC we make small but important modifications to this decomposition to improve its accuracy and to make it more suitable for use in MD simulations, resulting in the decomposition we call the u-series.
A. The bilateral series approximation for 1 / r The function 1 / r can be approximated by a bilateral infinite series of Gaussians of geometrically spaced widths, which we refer to as the bilateral series approximation (BSA).
Let ( ) = e ∕ ∕ √2π be a Gaussian of width σ, and let b > 1 be a positive constant (the base). The bilateral series approximation for 1 / r is defined as 9 The BSA is motivated by the identity which follows from the change of variables u = rt. Rearranged, this identity provides a wellknown integral expression for 1 / r, A quadrature of Eq. (13) with geometrically spaced quadrature points (i.e., t = b −x ) leads directly to the BSA approximation for 1 / r in Eq. (11).
We are not the first to observe that 1 / r may be represented by the series of geometrically scaled Gaussian terms given in Eq. (11). Notably, Beylkin and Monzon 21,22 have previously made use of the same observation as a starting point to develop approximations to 1 / r with bounds on the relative error for a finite range   r  1, where  > 0. Such approximations have been used in the context of electronic structure calculations, 25 and other approximations that use a sum of Gaussians to represent an electrostatic potential have found use in that context 26 or in quantum mechanics/molecular mechanics calculations. [26][27][28] Our goal is to develop a decomposition of 1 / r that is particularly suitable for use in MD simulations, and as a first step we establish some important properties of the BSA.
The absolute relative error of the BSA has the asymptotic bound as b → 1 (see Appendix B for details). We denote this asymptotic error bound by Mb and remark that it is independent of σ, being strictly a function of the base b. Numerical investigations show that Mb (14) is within 1% of the exact bound for b ≤ 2. The relative error for the BSA with b = 2 is illustrated in Fig. 1, with the asymptotic error bound M2 plotted in red.
The BSA features the scaling property which follows immediately from the substitution j = j + 1 in Eq. (11). It follows that the relative error has the property Owing to Eq. (16), the relative error is uniquely determined as a function of r by its values on the interval [1,b). These values are repeated in the interval [b k ,b k+1 ) for each integer k. The relative error is thus uniform across the entire real axis, being no worse than it is on the interval [1,b).
This uniform approximation property is shown in Fig. 1, which also illustrates that the BSA relative error has infinitely many roots that accumulate both to the origin and to infinity. The scaling invariance expressed by Eq. (16) implies this statement provided that there is at least one root in the interval [1,b), which we prove in Appendix D.

B. The BSA decomposition
Ewald's method evaluates the near part in real space and the far part in k-space, where they are respectively localized. The Gaussian terms in the BSA decay super-exponentially (i.e., faster 11 than exponentially) in real space and k-space as functions of r / (b j σ) and kσ / b j , respectively (see Appendix C for the k-space form of the BSA). The BSA thus provides a natural approximate decomposition of 1 / r (the BSA decomposition) into terms with positive and negative j: The components of the BSA decomposition are illustrated in Fig. 2.
The BSA decomposition has several desirable properties. The near and far terms have fast decay in real and k-space, respectively. The relative error of the BSA series converges exponentially in the limit b → 1, as shown in Eq. (14). Finally, the Gaussian components of the BSA are separable; in fact, Gaussians are the only separable spherically symmetric functions. As will be discussed in Section V, this enables the use of lower-latency parallel algorithms if the infinite series can be truncated.

C. The u-series
Two properties of the BSA decomposition make it not ideally suited for direct use in MD simulations. First, the near part of the BSA decomposition is not range limited, which leads to truncation errors when the near term is evaluated only up to a cutoff radius rc. Second, the error of the BSA decomposition is large in absolute terms for values of r much smaller than rc, since the relative BSA error extends all the way down to r = 0. The u-series addresses these issues by retaining the far part of the BSA decomposition ℱ ( ) ≜ ℱ , ( ) = 2 ln( ) ∑ (19) while replacing the near part with The cutoff radius rc is chosen to be the smallest root of The u-series ( ) is the sum of the near and far parts: By construction, the u-series equals 1 / r exactly for r < rc, and the near part has its range limited to r < rc. For r  rc, the relative error of the u-series is equal to that of ℱ ( ), thus approaching the uniformly bounded relative error of the BSA ℬ ( ) for r ≫ rc. Owing to the fast decay of Gσ(r), the relative error of the BSA and the u-series approach each other rapidly when r  rc, as illustrated in Fig. 3.
By choosing rc to be a root of ℱ ( ) − 1, we ensure continuity of ( ) at rc, and thus along the entire real axis. As a result, the u-series does not incur the truncation errors associated with the cut-and-shift approach. The smallest root is chosen because it minimizes the O(rc 3 ) computational work required to compute the near term.
The definition of the u-series relies on the existence of roots of ℱ ( ) − 1. We can confirm their existence by observing again that the near part of the BSA decomposition decays super-exponentially in r, so the relative error ℱ ( ) − 1 of the far part approaches ℬ ( ) − 1 for sufficiently large r, and thus has infinitely many roots away from the origin, as illustrated in Fig. 3. Because the near part is missing, the roots of ℱ ( ) − 1 do not accumulate at the origin, as they do for ℬ ( ) − 1. Fig. 3 plots the first four roots of ℱ ( ) − 1.
Note that so that rc is equivalently the smallest root of ( / )ℱ ( / ) − 1. For a given base b, define r0 to be the smallest root of ℱ ( ) − 1; then we have the relation rc = r0σ. In practice, depending on the best approach for maximizing computational performance on a specific platform, one can either choose σ and then set rc = r0σ, or choose rc and then set σ = rc / r0.
The smoothness of the u-series at rc can be increased by modifying the definition of the far part.
We use the nomenclature C k construction for a family of u-series decompositions with k continuous derivatives at all r > 0; accordingly the simplest form of the u-series-with far part defined in Eq. (19)-is referred to as the C 0 construction. A C 1 -continuous form of the u-series that we refer to as the C 1 construction, and which numerical tests described in later sections suggest is particularly useful for MD applications, is obtained by varying the coefficient of the narrowest Gaussian in the far part of the series as follows: 14 To ensure that the resulting decomposition is C 1 -continuous for a given base b, we determine the pair (r0, w0) for which ℱ ( ) and its first derivative evaluated at r0 are equal to 1 / r0 and −1 / r0 2 , respectively. Using Eq. (24) we can solve for w0 in terms of r0 such that ℱ ( ) = 1/ : r0 is then obtained by solving the equation As is true for the C 0 construction, there exist multiple solutions to Eq. (26), and we pick r0 to be the smallest one. The many solutions possible for any given b give rise to the curves shown in Fig. 4. The C 1 construction contains some solutions that are not optimal; that is, there is another solution with both smaller b (higher accuracy) and smaller r0 (lower computational cost). The solid red curves in Fig. 4 highlight the set of all optimal C 1 constructions, which comprise a countably infinite set of intervals.
The points of discontinuity marked by bullets in Fig. 4 are points where two roots of Eq. (26) coalesce. For these special values of b, the u-series is C 2 continuous. The points of C 2 continuity can be found by solving the system of equations with unknowns b, r0, and w0, where ℱ is the C 1 construction term specified by Eq. (24).
Eq. (27) admits countably many solutions, forming a sequence such that r0 → ∞ as b → 1. Each C 2 solution is the most accurate u-series in its interval. Parameters for the first three are given in Table 1, along with the corresponding bounds Mb on the relative error. We believe that the C 1 construction suffices for all purposes relevant to MD simulations, especially since it has a subset of C 2 -continuous solutions that can be used if such solutions are required. For even smoother approximations, additional parameters could be varied, for example the width s0 of the narrowest Gaussian, as in Varying the parameters defining the narrowest Gaussian is preferable, in order to prevent the appearance of large errors away from r0. We refer to Eq. (28) as the C 2 construction and provide parameters for b = 2 in the third row of Table 2, but we found that this and smoother constructions suffer from larger values of r0, as well as larger overshoot or undershoot of the BSA relative error. With respect to the magnitude of the relative error for r > r0, the C 1 construction is well behaved, as illustrated in Fig. 6 for b = 2. Although it undershoots the lower bound -M2 near r0, it blends more smoothly at r0 than the C 0 construction (compare to the black solid line of Fig. 3, which does not go out of bounds, but exhibits a clear kink at r0). For b ≤ 2, 16 numerical investigations show that the overshoot/undershoot never exceeds 20%, in part because the optimized values for the parameter w0 remain close to 1 (see Fig. 7).

IV. Numerical Experiments
We now assess the accuracy and performance of the u-series. In Section IVA we calculate errors in the energy and pressure obtained using u-series C 1 and C 2 constructions for 100 configurations obtained from the simulation of a lipid bilayer. We compare these errors to those obtained for the same set of configurations using an Ewald decomposition. This allows us to establish parameters that give rise to similar accuracy between the two decomposition methods, and hence to estimate their relative performance. Then, in Section IVB, we perform a series of simulations using different u-series parameters, and calculate observables that are known to be sensitive to the electrostatics. These results help us recommend u-series parameters that are useful in practice for MD simulations.
The lipid bilayer system we simulated was composed of 72 dipalmitoylphosphatidylcholine (DPPC) lipids solvated by 2189 molecules of water. The simulations were performed at a temperature of 323 K, a pressure of 1 atm, and a surface tension of 0 dynes cm −1 . Under these conditions, the bilayer is in its fluid state, 29 and the aspect of the box fluctuates by a few percentage points. These large fluctuations couple to the highly non-homogeneous distribution of charges in the direction perpendicular to the bilayer, making DPPC a sensitive test for electrostatics. Indeed, simple truncation of Coulomb interactions is known to produce major artifacts in DPPC, 30 and, although smoothly tapering 1 / r to zero before truncation can alleviate some of the problems, there can still be large discrepancies in properties such as the electrostatic potential difference between the center of the bilayer and the surrounding water. 31 The 100 simulation configurations used in section IVA were sampled at 1-ns intervals from a 100-ns simulation that used a very accurate parameterization of the Ewald decomposition. The simulations described in section IVB used the u-series. In all our simulations, the cutoff radius for van der Waals interactions was 12 Å.

A. Comparison to the Ewald decomposition
We start by focusing on errors that are intrinsic to the decomposition scheme: those that remain even if the direct and k-space sums are performed exactly.
In Fig. 8  To see how these errors in approximating 1 / r translate into errors in the computation of energies and pressures for configurations obtained from MD simulations, we used 100 configurations from a simulation of the bilayer system described above. We computed the energy and pressure for these configurations using different decomposition schemes and parameterizations, including a very accurate Ewald parameterization that yields essentially exact results. The absolute errors in the energies and pressures for the various parameterizations were then calculated as differences from these essentially exact results. In order to isolate errors due to the decomposition, a very accurate grid-based method with a very fine grid (128  128  128) was used to evaluate the far part, and all computations were carried out in double precision.
The results of our computations, which all used rc = 8 Å, are shown in Fig. 9. The red lines show the energy and pressure errors for the C 1 -continuous u-series for the same values of b = 2 and r0 = 1.989 used in Fig. 8, so that σ = 8 / 1.989 ≈ 4.02 Å. Such a value of σ leads to large absolute errors for the Ewald decomposition (not shown), but these errors can be decreased by decreasing σ. We find that we must decrease σ to approximately 2.50 Å for the Ewald decomposition errors to match the magnitude of the errors characteristic of the C 1 -continuous u-series with b = 2 (see Fig. 9). The value of σ for the u-series is 4.02 / 2.50 ≈ 1.61 times larger, which results in substantial computational savings, as we discuss below. If greater accuracy is required, then a u-series with a smaller base b is a good choice: The C 2 -continuous u-series with b ≈ 1.63 and σ = 2.91 Å shown in Fig. 9, for example, has an error that is one order of magnitude smaller than both Ewald and u-series with b = 2, σ = 4.02 Å, yet is still more computationally efficient than Ewald (σ = 2.91 Å > 2.50 Å).
So far we have focused on the errors intrinsic to the decomposition scheme by computing the far part very accurately. In practice, however, such accurate computations would be prohibitively expensive, so faster but more approximate computations would be performed leading to another source of error. We now redo the calculations leading to Fig. 9, but evaluate the far part using the level of accuracy that we typically use in MD simulations, and show the results in Fig. 10.
Specifically, we use the GSE 15 method, which was originally described in the context of the Ewald decomposition, but which is also straightforward to use to calculate the far contribution of the u-series decomposition. In GSE, charges are spread from particles to grid points within a certain radius; the tradeoff between speed and accuracy is principally controlled by the ratio of this charge-spreading cutoff radius to the grid spacing. We have found through experience that a 19 reasonable compromise between speed and accuracy is obtained when this ratio is 3.8, a value we have used extensively in simulations of a variety of systems, and which we also choose here.
Other GSE parameters follow from this choice: The width of the spreading Gaussian, which doesn't affect speed and so can be chosen to optimize accuracy, is σs = 0.83Δ, where Δ is the grid spacing; the GSE requirement ≥ √2 yields a maximum allowable grid spacing of Δ  σ / 1.17. Based on the results of this section and the arguments about computational work scaling in Section II, we now estimate the ratio of total work required to compute the electrostatic interaction with comparable error using u-series and Ewald decompositions under certain assumptions. Because they decay super-exponentially with r0 = rc / σ and only polynomially with rc for both Ewald and u-series, the errors are primarily functions of r0. In Section II, we argued that the minimum number of operations required to evaluate the direct and k-space sums are proportional to at least rc 3 and 1 / σ 3 , respectively; the total number of operations required is 20 for some positive constants Cn and Cf. Choosing rc to minimize Nops yields If polynomial interpolation is used to compute the interaction functions, then the overall computational cost is largely independent of the specific near/far functional forms (Eq. (6) for Ewald vs. Eqs. (19) and (20)  terms, and computational work that is strictly proportional to rc 3 and 1 / σ 3 , with equal constant factors. In practice, the relative performance of the two decompositions is dependent on the particular algorithms and computer architecture used.

B. Choosing u-series parameters for use in simulations
Although errors in the energy and pressure of DPPC are a useful way to judge the fidelity of electrostatics calculations, it is not clear, a priori, how these errors will impact the quality of the simulations themselves. To assess the accuracy required to avoid substantial simulation artifacts, we looked at two properties mentioned earlier as being sensitive to electrostatics: the surface area of the bilayer and the electrostatic potential difference between the water and the center of the membrane. We used as our baseline an Ewald simulation with a large cutoff radius (rc = 13.0 Å) to ensure high accuracy. As a threshold for acceptable accuracy, we required our observed mean properties to fall within the natural fluctuations-chosen here to be plus or minus one standard deviation-as estimated from the baseline run; this should help ensure that simulations using the electrostatic approximations still significantly overlap the target ensemble. Simulations were run for 2 to 4 μs each, with frames saved every 0.24 ns, and the first 120 ns of each simulation discarded from the analysis. Surface area was defined as the area of the simulation box parallel to the membrane (the x-y plane), and the electrostatic potential drop was estimated by first using the tool pmepot 32 to compute a smoothed estimate of the electrostatic potential on a grid, and then averaging grid values that share the same z coordinate. 31 All simulations used a 2.5 fs time step and a 64 × 64 × 64 grid to compute the far electrostatics. to our choice of rc, suggesting that our accuracy was sufficient. By comparison, simulations that use cutoff electrostatics can have errors in the mean potential drop of well over 100%, 31 and are likely too inaccurate for membrane systems. Since DPPC is the most sensitive realistic system we have tested with respect to electrostatic approximations, it seems reasonable to expect that even the less accurate u-series settings shown in Fig. 11 should be conservative enough for most simulation purposes. In fact, we have used fairly aggressive parameters (b = 2, rc / σ ≈ 1.989, rc = 8 Å) in other tests-reversible folding of villin headpiece and the temperature-dependent helicity of an Ala15 peptide, for example-and have achieved results consistent with simulations that use more accurate electrostatics. We also discuss parameter choice in the Summary and Conclusion section.

V. A Separable Real-Space Algorithm for Evaluating the Electrostatic Energy
In arriving at our main finding that the u-series requires roughly half the computation of the Ewald approach, we assumed that the far part of each decomposition would be calculated using the same algorithm. Since the u-series far part is well approximated by a sum of a small number of Gaussians, however, it can be computed by certain algorithms that the Ewald far part cannot, leading to an additional potential performance advantage for the u-series in certain situations. In this section, we sketch one such algorithm that may be advantageous on massively parallel machines.
On massively parallel supercomputers, the time required to evaluate the Ewald k-space sum is strongly affected by communication latency. 6,7 Since the far interaction is not range limited, any parallel algorithm requires at least one global operation with the property that data from each processor affects the result of the computation on every other processor. In practice, computing the far interaction with a single global operation is rarely achieved. Many MD packages use PME 11,12 to compute the k-space sum, in which a fast Fourier transform (FFT) is performed on the grid, followed by a multiplication in k-space, followed by an inverse Fourier transform. Each Fourier transform is a global operation, so PME requires two global operations.
The u-series k-space sum can alternatively be computed in real space with a single global operation by taking advantage of the dimensional separability of the Gaussian kernel: 23 First, we truncate the far part of the u-series at N terms, with N chosen to be large enough so that the truncation error is acceptably small (this will be quantified shortly). As long as a separable function S(r) = S0(x)S1(y)S2(z) such as a B-spline (PME) or Gaussian (GSE) is used to spread particle charges onto the grid, the k-space kernel for the resultant on-grid convolution Computing the k-space sum in real space as described requires greater total computation and bandwidth because separate sets of convolutions must be performed for each of the N terms, whereas FFT-based approaches operate only on the sum of the terms. This tradeoff of increased computation and bandwidth for reduced latency may be advantageous on massively parallel machines. Additionally, the computation and bandwidth can be reduced using other techniques if necessary. The on-grid Gaussian kernels corresponding to different terms have increasing widths and, consequently, are more suitably handled on increasingly coarser grids. The u-series can thus be implemented at reduced computational cost by computing its far part using the multilevel summation method of Skeel et al. [33][34][35] The practicality of the reduced-latency real-space algorithm depends critically on the value of N, because both bandwidth and computation scale in proportion to N. Fortunately, periodicity causes the Gaussians that are broader than the longest side of the unit cell to self-cancel, and we can show (Appendix F) that the relative error for the far part of the periodic energy when truncated to N terms is bounded by where L = max{Lx, Ly, Lz}, generally leading to small values of N. Consider, as an example, the

VI. Summary and Conclusion
The u-series decomposes 1 / r into a sum of near and far parts such that the resulting approximation has a number of desirable properties: It is smooth on the entire real axis, is exact up to a cutoff radius rc, and has uniformly bounded relative error for r > rc. These factors contribute to the u-series being more accurate than the Ewald decomposition for a given amount of computational effort. This in turn lends the u-series a computational advantage: In our experiments, roughly half as much work was required to achieve the same degree of accuracy as the Ewald decomposition.
The u-series is also constructed so that its far part is a sum of separable functions, and we have shown that, with periodic boundary conditions, only a small number of summation terms need to be evaluated in order to compute the far part to machine precision. This enables the use of minimal-latency algorithms wherein the three-dimensional convolutions with each of these Gaussians are computed in parallel in real space as a product of three one-dimensional convolutions. This algorithmic advantage of the u-series for massively parallel supercomputers will likely become more important over time as computational density and communication bandwidth continue to scale more rapidly than communication latency. Section V describes only one possible algorithm for exploiting the separability of the u-series (one that has been implemented in Anton 2), 36 leaving room for further algorithmic research.
Among the u-series variants that we have presented, the C 1 construction appears to be the most useful in practice for MD simulations of biophysical systems. Our experience has been that most simulations are well served by the C 1 construction with b = 2, which is sufficiently accurate even for electrostatic cutoff radii as low as rc = 8 Å. The parameters for the C 1 construction with b = 2 are given by the second row of Table 2, and represent our recommended u-series 26 parameterization for most MD simulations carried out near or above room temperature. When the accuracy of the electrostatics is the primary concern, the C 2 construction may be used; even the least accurate C 2 parameterization given in Table 1 (b = 1.629…) is more accurate than the Ewald decomposition with the same cutoff radius and has a comparable computational cost.
In this work, we have focused on the decomposition of the Coulomb potential 1 / r, but the bilateral series approximations in Section III can be used more generally for 1 / r  , as noted by Beylkin and Monzon. 21,22 This in turn leads to u-series decompositions for 1 / r  , which may prove useful, for example, for splitting the Lennard-Jones dispersion interaction 1 / r 6 . Similarly, although the present work focuses exclusively on decompositions based on Gaussian series approximations, Appendix A shows how the BSA can be constructed based on series of arbitrary well-behaved functions ϕ not limited to Gaussians, leading to alternative formulations. Although Gaussians have several desirable properties for our purposes, formulations based on alternative functions may be useful in other contexts.
The present work leaves open some mathematical questions about the u-series that would benefit from further investigation. We have shown that the pointwise convergence of the u-series is exponentially fast as b approaches 1 at constant σ. We have not proven, however, that this convergence for the Coulomb kernel implies similarly fast convergence for the total energy.
While pointwise convergence typically entails convergence of integrals, a more precise statement and formal proof of the convergence of the total energy would be desirable. Additionally, for constant b, the u-series converges to 1 / r in the limit rc → ∞. The numerical evidence presented in Appendix E suggests that the total electrostatic energy converges polynomially fast, with a degree one larger than the degree of smoothness of the u-series, but we do not have a proof for this conjecture. Mathematical investigations beyond the scope of this paper are thus warranted. Retaining only the dominant terms in Eq. (B1), we infer in the limit Δ → 0. Adding the equal contribution of | Γ(α / 2  iπ / Δ) | and dividing by Γ(α / 2) produces the asymptotic bound This generalizes the bound given for  = 1 in Eq. (14).
Numerical investigations for b ≤ 2 have shown that this bound is a reliable estimate of the relative error maximum, meaning that it is either an upper bound (for α > 3/2), or within 1% of 30 the exact result (for α ≤ 3/2). The excellent agreement for small values of α is illustrated in Fig.   1 for Coulomb potentials. The maximum relative error is about 2 × 10 −3 for b = 2 and less than 2 × 10 −6 for b = 2 1/2 .

Appendix C. The BSA in Fourier Space
For the Coulomb kernel considered in the body of the paper, the expression for the BSA in kspace is given by where denotes Fourier transform. Eq. (C1) closely resembles the BSA itself in Eq. (11); we can generalize this result to arbitrary scaling functions ϕ(r) and potentials of the form 1 / r α .
Observe first that the three-dimensional Fourier transform of the scaling function ϕ(r) is spherically symmetric in k-space. As a reminder of this symmetry, we shall denote the Fourier transform by ( ), even though the proper notation is ( ). The Clearly, the expression in the square brackets is a BSA in k converging to 1 / k 3−α as k  , with an error bound that decays exponentially in b. When ϕ is a Gaussian of width σ, is a Gaussian of width 1 / σ, and setting α = 1 leads to Eq. (C1). More generally, we show that where For this to hold, it suffices to make the following assumptions about ϕ(r). First, ϕ(r) decays at infinity at least as fast as r −3 , together with its first two derivatives. Second, ϕ(r) has secondorder derivatives of bounded variation. Note that a Gaussian satisfies these assumptions. As can The integral of the relative error against the positive measure dr / r could not vanish if the relative error did not change sign on the interval [1, b). There thus exists at least one root in the interval [1, b), and the root structure illustrated in Fig. 1 is established as a universal property of the bilateral series.
Appendix E. Impact of rc on u-series vs. Ewald In Section IV we saw numerical evidence that reducing σ by a factor of 1.61 (Fig. 8) produces an Ewald decomposition similar in accuracy to the C 1 -continuous u-series with b = 2 and rc = 8 Å.
For larger values of rc, the factor 1.61 needs to be increased. Fig. 12 plots the root-mean-square deviations (RMSDs) of the errors as a function of the cutoff rc, subject to the same ratios rc / σ used for Fig. 9. The errors are calculated for the same 100 lipid bilayer configurations used in Section IV. As rc increases, the energy fluctuations for the different methods decay to 0 at rates that are related to their smoothness. Fig. 12 shows that the RMSDs for the errors in the electrostatic energy decay as fast as 1 / rc for the Ewald decomposition with constant rc / σ (which is C 0 continuous), 1 / rc 2 for the C 1 -continuous u-series with b = 2, and 1 / rc 3 for the C 2continuous u-series with b ≈ 1.63. This figure supports the general postulate that the errors for a C d -continuous u-series decrease as fast as 1 / rc d+1 .
Based on the results of the preceding paragraph, it is tempting to consider smoother u-series than those provided by the C 1 construction. For b = 2, the C 2 -continuous u-series obtained from Eq. (28) has r0 ≈ 2.394, a considerable increase from the corresponding values of 1.989 and 1.843 for the C 1 -and C 0 -continuous u-series of same base b (see Table 1). In Fig. 13, we compare this C 2 -continuous u-series to the C 1 -continuous u-series having the same r0 = rc / σ (that is, the same computational effort) rather than the same base b. The C 1 -continuous u-series appears to be about as good as the C 2 -continuous u-series for moderate values of rc and better for smaller values. It will eventually be overtaken by the C 2 -continuous u-series in the limit of large rc, but the values of rc for which this happens are so large that they are impractical. We believe this result suggests that the C 1 construction is sufficiently versatile to accommodate all needs arising in the practice of MD simulations, and we choose not to explore smoother approximations any further.   points where the BSA far part equals 1 / r are marked with dots, and the first of these (which is equal to r0 since  = 1) is selected as the cutoff radius rc. The u-series is defined to coincide with the BSA far part for r  rc, so their relative errors also coincide for r ≥ rc. To the left of rc, the useries is constructed to be exactly 1 / r; its relative error on this domain is thus 0.    configurations of a DPPC system. The configurations were taken at 1-ns intervals from a single 100-ns simulation. All calculations used the cutoff radius rc = 8 Å for the near electrostatics. A grid-based method with a very fine grid was used to evaluate the far part, so as not to introduce additional errors beyond those inherent to the decomposition scheme chosen. In red are the results of the C 1 -continuous u-series with b = 2 and σ = 4.02 Å (the same ratio of rc / σ as used in Fig. 8). In black are the results of the Ewald decomposition with σ = 2.50 Å. The errors of the two decompositions are comparable in magnitude, despite the smaller value of σ used for the Ewald decomposition. Also shown, in blue, are results of a simulation using the u-series with b ≈ 1.63, corresponding to the first C 2 -continuous solution from Table 1; even though this simulation uses σ = 2.91 Å, a larger value than in the simulations using the Ewald decomposition, it produces substantially more accurate results. 50 Figure 10. Errors in the energy and pressure for the same configurations of DPPC and same decomposition schemes as shown in Fig. 9, but with the far part evaluated using approximate methods typical of current simulation practice. Grid sizes used are given in parentheses in the legend. The larger value of σ provided by the u-series allows coarser grids to be used. For

Figures and Tables
Ewald and u-series with b = 2, the contribution to the error from the k-space sum is small, and the plots mirror those of Fig. 9. For the more accurate u-series with b ≈ 1.63, the error from the k-space sum contributes more substantially to the total error.   Table 2. Parameters for the C 0 , C 1 , and C 2 constructions with b = 2 (Mb = 2.289  10 −3 ). Refer to Eqs. (19), (24), and (28), respectively. The second row (C 1 construction with b = 2) is our recommended parameterization for most MD simulations.