Low rank factorization of the Coulomb integrals for periodic coupled cluster theory

We study the decomposition of the Coulomb integrals of periodic systems into a tensor contraction of six matrices of which only two are distinct. We find that the Coulomb integrals can be well approximated in this form already with small matrices compared to the number of real space grid points. The cost of computing the matrices scales as O(N^4) using a regularized form of the alternating least squares algorithm. The studied factorization of the Coulomb integrals can be exploited to reduce the scaling of the computational cost of expensive tensor contractions appearing in the amplitude equations of coupled cluster methods with respect to system size. We apply the developed methodologies to calculate the adsorption energy of a single water molecule on a hexagonal boron nitride monolayer in a plane wave basis set and periodic boundary conditions.


I. INTRODUCTION
The high dimensionality of the many electron wave function is one of the most limiting factors in applying highly accurate electronic structure theories to the solution of the many electron Schrödinger equation for real materials on an ab initio level 1 . Many of the most widely used wave function based theories have a good balance between accuracy, computational cost and the number of parameters used to approximate the exact wave function. The efficiency is strongly affected by the computational complexity required to evaluate resultant expectation values.
Tensor rank decompositions (TRD) and low rank tensor approximations are ubiquitous in the field of electronic structure theory calculations. These techniques are essential to reduce the computational cost and memory footprints to calculate and store the approximate many electron wave function. Already the simplest level of approximation, Hartree-Fock (HF) theory can be regarded as a low rank approximation, employing an antisymmetrized outer product of one electron orbitals to approximate the full wave function. This low rank tensor approximation is identical to a single Slater determinant. However, HF theory neglects electronic correlation effects. Electronic correlation effects can be captured by extending the wave function basis with additional determinants. For this purpose, excited HF determinants can be employed. They are constructed by replacing occupied orbitals with unoccupied orbitals, forming a complete and orthogonal basis. Computationally, the basis of (excited) Slater determinants is very convenient. It introduces a large degree of sparsity to the full many electron Hamiltonian and simplifies the solution of the many a) Electronic mail: f.hummel@fkf.mpg.de b) Electronic mail: a.grueneis@fkf.mpg.de electron problem. Most entries in the sparse many electron Hamiltonian can be calculated directly from electron repulsion integrals. The memory footprint for the storage of these integrals in a canonical basis is very large and grows rapidly with respect to the number of orbitals. Therefore it is often necessary to calculate these integrals in a computationally efficient on the fly manner. The most widely used schemes for the calculation of electron repulsion integrals include 2-6 : (i) prior calculation of the integrals in the employed atomic orbital basis and its subsequent transformation into a molecular orbital basis, and (ii) employing the resolution of identity approach. Computationally the resolution of identity approach is more efficient because it requires the calculation and storage of intermediate quantities with at most three indices. In passing we note that the expression of the integrals in terms of these intermediate quantities allows for rearranging nested summations in ring coupled cluster theories such that the scaling of the computational cost with respect to the system size can be reduced 7 .
Coupled cluster theory can also be viewed as a low rank tensor approximation to the exact configuration interaction wave function coefficients in the Slater determinant basis. The exponential ansatz used in coupled cluster theories effectively approximates the coefficients of highly excited determinants by outer products of cluster amplitudes with a lower rank. However, increasingly accurate levels of coupled cluster theories lead to increasingly steep polynomial scalings of the computational cost and memory with respect to the studied system sizes. In this work we seek to reduce the computational cost of coupled cluster theories without introducing additional approximations on the level of the employed wave function. This can be achieved by employing low rank tensor approximation techniques for the decomposition of the two electron integrals and the corresponding intermediate quantities obtained from the resolution of identity approach. Using the low rank decomposition, the nested summations in the amplitude equations of distin-guishable cluster singles and doubles theory (DCSD) 8-10 , as well as of linearized coupled cluster singles and doubles theory can be rearranged such that the scaling of the computational cost is reduced from O(N 6 ) to O(N 5 ) without any further approximations. Additionally, we highlight that the low rank factorization of the Coulomb integrals also allows for reducing the scaling of the computationally most expensive terms in coupled cluster singles and doubles (CCSD) theory using a plane wave basis set.
We note that the methods outlined in this work share many similarities with other approaches that aim at the reduction of the computational cost in correlated wave function based theories. In particular we want to point out that the tensor hypercontraction (THC) technique introduced by Hohenstein et al. in Refs. 11-13 also performs a low rank tensor decomposition of the Coulomb integrals. Furthermore, a similar approach to the tensor rank decomposition method introduced in this work was discussed by Shenvi et al. in Ref. 14. In the work of Benedikt et al. it was shown that tensor rank decomposition techniques can even be applied to the decomposition of the coupled cluster amplitudes directly 15 . However, in contrast to the methods mentioned above we introduce an efficient numerical procedure to achieve the low rank tensor decomposition of the Coulomb integrals in periodic systems without the necessity of defining an a priori real space grid as it is the case for THC methods.

A. Structure of this work
The factorization of the Coulomb integrals tensor is obtained in two steps. In Section II we first discuss how the Coulomb integrals can be decomposed into a contraction of two third order tensors: V pq sr ≈ Γ * pF s Γ q rF , where we refer to Γ q rF as optimized Coulomb vertex. Subsequently, we perform a Tensor Rank Decomposition (TRD) of the optimized Coulomb vertex into a contraction of three matrices: Γ q rF ≈ Λ R F Π * q R Π R r . Section III describes the employed algorithms to compute this factorization.
Section IV outlines how this factorization can be employed by quantum chemistry methods and in Section V we study the application of the discussed approximations to different systems. Subsection V A focuses on the convergence of the TRD for total energies of the LiH solid, while we compute coupled cluster adsorption energies of water on the surface of a single BN sheet in Subsection V B.

B. Notation
We imply a sum over all free indices occurring at least twice within a product but nowhere else. We will use the letters i, j, k, l to label occupied spin orbitals, a, b, c, d to label virtual spin orbitals and p, q, r, s to label general spin orbitals. The letters R, S, T, U are used to denote el-ements of the rank decomposition. The conjugate transpose of a tensor such as A q r is denoted by A * r q where lower and upper indices are swapped. Sequence numbers in iterations are given in superscript within parentheses, as in A (n) . The Frobenius norm of a tensor A is denoted by A . Examples are:

II. OPTIMIZED AUXILIARY FIELD APPROXIMATION
In this section we discuss how to approximate the Coulomb integrals, a tensor of fourth order, by a contraction of two considerably smaller tensors of third order: V pq sr ≈ Γ * pF s Γ q rF , without actually calculating the entire tensor V pq sr . Given the spin orbitals ψ q (x) from a Hartree-Fock (HF) or density functional theory (DFT) calculation, the (nonantisymmetrized) Coulomb integrals are defined by with x = (σ, r) and dx = σ dr. Owing to the translational invariance of the Coulomb kernel, we can separate the Coulomb integrals as follows where the Coulomb vertex Γ q r (G) is given by We let its discretization beΓ q rG = √ w G Γ q r (G G ) with the momentum grid points G G and the numerical integration weights w G such that The Coulomb vertex can be computed from the spin orbitals in O(N 2 p N G log N r ) time employing a Fast Fourier Transform (FFT), where N p , N G and N r denote the number of spin orbitals, momentum grid points and real space grid points, respectively. We note that the orbital overlap charge density ψ * q (x)ψ r (x) is approximated in the projector augmented wave method using Eq. (2.87) of Ref. 16 as implemented in the Vienna ab initio simulation package (VASP) [17][18][19] .
In general, any two body operator can be split into a product of two single body operators coupled by an auxiliary field 20 . In the case of the Coulomb interaction the auxiliary field has only one variable due to translational invariance, which is the momentum G mediated by the interaction. AlthoughΓ q rG is in practice already a third order tensor, the number of momentum grid points N G of the HF or DFT calculation is usually too large to continue with the tensor rank decomposition of the Coulomb vertex directly. A large set of momenta will have a small but nonnegligible contribution to the correlation energy and we seek a more compact set of auxiliary field variables with fewer relevant elements. Let be a singular value decomposition of the Coulomb ver-texΓ I G , written as a matrix with the compound index I = (q, r), where the singular values in Σ are sorted in descending order. Eq. (8) is also shown in form of a wiring diagram of the involved tensor contractions on the right. Taking only the largest N F < N G singular values of the unapproximated Coulomb vertexΓ into account we can define the optimized auxiliary field (OAF) Coulomb vertex Note that we write Γ without a tilde for the approximated Coulomb vertex, in contrast to usual convention, simply because we will not use the unapproximated vertex in any subsequent step.
We are only interested in the left singular vectors U F G associated to the largest singular values so we contract Eq. (8) from the right withΓ * G transforming a singular value problem of a large N G ×N 2 p matrix into an eigenvalue problem of a comparatively small N G × N G hermitian matrix. The eigenvalues of E are the squares of the singular values ofΓ I G , and the left eigenvectors of E associated to the largest eigenvalues are also the left singular vectors ofΓ we need in order to transform the Coulomb vertexΓ I G into the optimized auxiliary field Coulomb vertex Γ I F according to Eq. (9). Note that this approach becomes numerically problematic for very small singular values since one only has access to their squares. However, we find that all N F largest singular values needed for an accurate approximation of the Coulomb vertex are sufficiently large.
Inserting the singular value decomposition of the Coulomb vertex with sorted singular values from Eq. (8) into the discretized definition of the Coulomb integrals given in Eq. (7) yields a singular value decomposition of the Coulomb integrals, also with sorted singular values: sr are now written in matrix form V I J with I = (q, r) and J = (s, p). Thus, using the optimized auxiliary field Coulomb vertex Γ I F instead of the full Coulomb vertexΓ I G best approximates the Coulomb integrals with respect to the Frobenius norm of the difference:

III. DECOMPOSITION OF THE COULOMB VERTEX
The form of the Coulomb vertex in real space on the right hand side in Eq. (6) suggests that the optimized Coulomb vertex Γ q rF can be decomposed in an analogous manner into a product of three tensors of second order, denoted and depicted as follows We let N R denote the number of vertex indices R and refer to it as the rank of the Coulomb vertex for a given quality of the approximation. We call the matrices Π R r and Λ R F factor orbitals and Coulomb factors of the Coulomb vertex, respectively.
The decomposition is invariant under scaling of the Coulomb factors Λ R F with any real scalar a R > 0 while scaling the factor orbitals Π R r with a complex scalar c R with |c R | = 1/ √ a R for each value of R. One can also choose an alternative ansatz to Eq. (13) for approximating the Coulomb vertex Γ q rF which does not involve the conjugation of the factor orbitals on the outgoing index q. This ansatz reads The above decomposition is invariant under scaling of the Coulomb factors Λ R F with any complex scalar c R = 0 while scaling the factor orbitals Π R r with ±1/ √ c R for each value of R. This is in contrast to the symmetries of the ansatz according to Eq. (13). In the Alternating Least Square (ALS) approximation scheme, which we employ for fitting the factor orbitals Π and the Coulomb factors Λ, it is preferable to have the symmetries of Eq. (14) since fixing one of the two factors removes all continuous symmetries from the other. This accelerates convergence and allows a smaller rank N R in practice. A downside of this ansatz is that one loses the simple notion of particle and hole propagators as given in Eq. (34) and Eq. (35), respectively. The propagators can, for instance, be employed to calculate second order Møller-Plesset theory (MP2) correlation energies in O(N 4 ), as outlined in Subsection IV A. Unless otherwise stated we use the ansatz of Eq. (14) for the applications presented in Section V but we will continue to discuss the more widely applicable ansatz of Eq. (13). If one chooses N R = N r , where N r is the number of real space grid points, the validity of the ansatz follows directly from Eq. (6). We want to investigate how low N R can be chosen compared to N r for a sufficiently faithful decomposition having an error below 1% in the energies calculated from the factor matrices. For reference we use the error in the MP2 energy assuming that other terms, occurring for instance in the coupled cluster amplitude equations, exhibit a similar behavior. This accuracy is assumed sufficient since in practice only those terms will be calculated from the factor matrices that pose either computational or memory bottlenecks. Furthermore, we want to show that the ratio N R /N r for a sufficiently faithful decomposition is independent of the system size if the system is not too small.

A. Canonical polyadic decomposition algorithms
A factorization of a tensor according to the ansatz of Eq. (13) or (14) is referred to as canonical polyadic decomposition 21 (CPD). For a given rank N R , the factor orbitals Π R r and the Coulomb factors Λ R F can be fit by minimizing the square of the Frobenius norm of the difference The above optimization problem is high dimensional and nonquadratic. Conjugate gradient algorithms or other local algorithms may require thousands of steps until sufficiently converged. Global optimization algorithms try to tackle the problem by keeping a subset of the variables fixed and optimizing only the remaining variables. In the case of the alternating least squares 22 (ALS) algorithm the optimization is done in turn over each matrix, while in the case of the cyclic coordinate descent 23 (CCD) algorithm the optimization is done in turn over each value of the index R. We have studied the performance of a regularized version of the ALS here.

B. Alternating least squares
In the case of three distinct factors T ijk ≈ A iR B jR C kR two of them can be regarded fixed leaving a least squares problem for finding the optimal third factor. Each matrix is optimized in alternating order leading to the alternating least squares (ALS) algorithm which has to be solved iteratively until convergence, starting with random matrices A (0) , B (0) and C (0) . Each least squares problem has a unique solution, which can be written explicitly. For Eq. (16) it is for instance given by omitting the iteration specification on B and C for brevity. G + denotes the Moore-Penrose pseudoinverse 24,25 of the Gramian matrix G. For Eq. (16) the Gramian matrix is given by The expressions for the other matrices can be written in an analogous manner. When applying the ALS algorithm to decompose the Coulomb vertex the computationally most demanding steps are the calculation of the pseudo inverse G + scaling as O(N 3 R ), as well as the contraction of T ijk with either factor B * jS or C * kS in Eq. (19), depending on which is larger, scaling as O(N 2 p N F N R ).

C. Regularized alternating least squares
Although the ALS algorithm guarantees an improvement of the fit quality in each iteration the convergence can be very slow, especially when there are multiple minima for a factor A in different regions having all similar minimal values. In that case the best choice for A may vary strongly from iteration to iteration since updating the other factors B and C can change the order of the minima. This behavior is referred to as swamping 26 and it takes many iterations before the ALS algorithm converges to one region for each factor that globally minimizes the fit quality. Introducing a penalty on the distance to the previous iteration limits swamping and leads to the regularized ALS 27 (RALS) algorithm: The solution of each regularized least squares problem can again be given explicitly, here for instance for Eq. (21), and again omitting the iteration specification on B and C In the regularized case the Gramian G depends on the regularization parameter λ where δ RS denotes the Kronecker delta. The regularization parameter λ (n) A for finding A (n+1) in the nth iteration still remains to be determined. Too low values allow swamping to occur while too large values unnecessarily slow down the convergence. To estimate an efficient regularization parameter we assume that the fit quality AB (n) C (n) − T 2 in the term to minimize in Eq. (21) varies little from one iteration to the next. We also assume this for the local change of the fit quality with respect to each value in A. This allows us to relate the minimized term in the previous step n−1 to the minimized term in the step n for which we want to determine the regularization parameter: If A (n) and A (n+1) have similar norm we can also relate their relative step sizes s (n) where the relative step size in the nth iteration is given by We want the relative iteration step size s (n+1) A of the next iteration to be approximately as large as a chosen maximum value s 0 , which we refer to as swamping threshold. From that we define the estimated regularization parameter for the nth iteration λ (n) Using the above estimate directly results in a regularization which we find alternately too strong and too weak.
To ameliorate this we introduce a mixing of the estimated regularization parameterλ (n) A for the nth iteration, as above, with the regularization parameter λ (n−1) A of the previous iteration to obtain the regularization parameter λ (n) A employed for the nth iteration in the RALS: Regarding the choice of the swamping threshold s 0 and the mixing factor α we find that s 0 = 1.0 and α = 0.8 offers a good compromise allowing quick convergence while still preventing swamping for the systems studied so far.

D. Quadratically occurring factors
In the case of the Coulomb vertex the factor orbitals Π R r occur quadratically. For finding the next estimate Π (n+1) in the alternating least squares algorithm we use an iterative algorithm similar to the Babylonian square root algorithm. Each subiteration is given by with Π (n+1,0) := Π (n) and the mixing factor 0 < β < 1. Note that Π * is a fixed parameter rather than a fitted one and that the regularization parameter λ in the RALS algorithm, respectively. The number of subiterations M needs to be sufficiently large, such that Π (n+1) is at least an improved solution of the entire fit problem compared to Π (n) . A large number M of subiterations gives an estimate Π (n+1) that is close to the optimal choice of Π for a given Λ (n+1) . However, the cost of each subiteration are similar to the cost of the fit of Λ in the RALS algorithm and as a good choice for minimizing the overall computational cost we find β = 0.8 and M ≥ 2, but only as large such that the solution is an improvement. We point out that there are alternative methods for solving the quadratically occurring factors 28 .

IV. APPLICATION OF THE LOW RANK FACTORIZATION
The algorithms described so far yield an approximate factorization of the Coulomb integrals of the form where the factors Π and Λ are N p × N R and N F × N R matrices, respectively. We find that the rank of the decomposition N R is about an order of magnitude lower than the number of real space grid points of the original factors of the Coulomb integrals, being the orbitals ψ q (x). In Section V we study the convergence of the approximation in detail. In this section we discuss how this factorization can be applied to lower the scaling of the computational cost of wave function based methods such as second order Møller-Plesset (MP2) theory or coupled cluster theory.

A. MP2 from imaginary time propagators
The factorization of the Coulomb integrals permits evaluating the terms in the perturbation expansion by summing over all vertex indices R, S, T, . . . occurring in the term's diagram, contracting propagator matrices for each particle, hole and Coulomb line. For instance, the exchange term of second order Møller-Plesset (MP2) theory can be evaluated as follows: where the imaginary time dependent propagator matrices are given in terms of the decomposed factor orbitals Π R r and the Coulomb factors Λ R F : where µ is the Fermi level energy at zero temperature. The imaginary time integration can be done numerically on a small grid, for instance the minimax grid 29 , which allows an evaluation of the MP2 correlation energy in O(N 4 ) in principle. In practice, however, this approach outperforms canonical MP2 calculations only for very large systems. The propagator matrices G R S (±τ ) are analogous to the one body particle/hole Green's functions in real space and imaginary time G 0 (x, x ; ±τ ). The time independent matrix propagator V S R corresponds to the Coulomb kernel. The factorization of the one body energies ε a,i in imaginary time is equivalent to the Laplace transformed MP2 ansatz of Almlöf 30 . Note that the imaginary time dependent matrices defined above are actually not propagators in the sense that for all R, T and τ 1 , τ 2 > 0 for particles as well as τ 1 , τ 2 ≤ 0 for holes. They can only be used to directly connect vertices of the Coulomb interaction. If propagators in the above sense are required, one needs to employ a stricter ansatz for the factorization of the Coulomb vertex, namely where Π + denotes the Moore-Penrose pseudo inverse of Π. The convergence behavior of this ansatz remains, however, to be studied. One can also evaluate the perturbation terms stochastically, directly using the real space Green's functions G 0 (x, x ; ±τ ) rather than using the low rank propagators G R S (±τ ). This has been done for one dimensional solids 31,32 and for three dimensional solids 33 .

B. Reduced scaling coupled cluster theory
The most demanding step in the canonical CCD (DCD) method using a plane wave basis set is the calculation of the particle/particle ladder contribution T cd ij V ab can be exploited to break down the simultaneous contraction over the indices c and d into a sequence of contractions involving only at most one index, as can be seen from the wiring diagram of the involved tensors: The most expensive term in this sequence of contractions leads to a scaling of O(N 2 o N v N 2 R ) in time, without exceeding the memory complexity of the coupled cluster amplitudes. As will be demonstrated in Section V, we find N R to be proportional to the system size N , resulting in an O(N 5 ) scaling behavior in time of the particle/particle ladder contribution. Furthermore, the DCD amplitudes equations can be solely reformulated in an O(N 5 ) implementation with the use of the Coulomb vertex and its decomposed approximation, due to the absence of exchange terms between different clusters. Likewise, the most expensive term in CCSD (DCSD) amplitude equations includes the singles contribution to the Similarly, these terms can be evaluated via the factor orbitals and the Coulomb factors in an O(N 5 ) scaling in time and in the DCSD approximation no term exceeds this scaling behavior.

A. Total energies of the LiH solid
We first seek to discuss the convergence of the low rank factorization with respect to the number of iterations, N R and the system size. To this end we study different supercell sizes constructed from two atomic LiH crystal unit cells including 2×2×2, 3×2×2, 3×3×2, 4×3×2 and 3×3×3, corresponding to 16,24,36,48 and 54 atoms, respectively. In this subsection we only employ MP2 theory to investigate the behavior of the correlation energy calculated from the factorized Coulomb integrals. This study focuses on the decomposition of the Coulomb vertex, neither employing the optimized auxiliary field nor the pseudized Gaussian type virtual orbitals 34 technique. The resulting Coulomb vertices that need to be fit are large, such that only the particle/hole part Γ a iF is used for this study. The kinetic energy cutoff defining N F was set to 200 eV. The Li 2s 1 and H 1s 1 states have been treated as valence states. Fig. 1 shows the relative error of the MP2 correlation energy retrieved as a function of the number of iterations. The relative error is computed from MP2 energies that employ integrals that have been calculated with and without the low rank tensor factorization. Fig. 1 reveals that the rate of convergence for the relative error of the MP2 energy is very similar in all different system sizes. We note that the smallest supercell containing 16 atoms only exhibits a slightly faster convergence. From these results we conclude that the required number of iterations in the tensor factorization algorithm is system size independent for intensive properties. Furthermore we find that 100 iterations are sufficient to achieve a relative accuracy of 1% with a rank that corresponds to 2N F . The right side of Fig. 1 explores the convergence of the relative error in the MP2 correlation energy error retrieved as a function of N R /N F . Note that N F corresponds to the number of plane wave vectors and scales linearly with respect to system size. This plot shows that the required rank N R needed to achieve a certain relative level of accuracy also scales linearly with respect to the system size. Furthermore we find that systematically improvable exponential convergence can be achieved for this system and property by increasing N R /N F .
The computational cost for obtaining the TRD of the Coulomb vertex with N o = 27, N v = 8469, N F = N G = 1830 and N R = 1830 is roughly 1000 CPU hours. Therefore the computational cost of the TRD exceeds the computational cost of a full MP2 calculation which is roughly 10 CPU hours in the present case despite the fact that the TRD formally scales more favorably with system size. However, we note that the present TRD algorithm is sufficiently efficient to reduce the total computational cost of coupled cluster theory calculations as discussed in the following.

B. Molecular adsorption of water on hexagonal boron nitride
We now turn to the application of the newly developed methodologies to some more challenging problems. We calculate the interaction between a water molecule and a hexagonal boron nitride (hBN) monolayer. We employ periodic coupled cluster doubles (CCD) and examine to what extent the TRD and the optimized auxiliary field approximations are accurate and efficient. We used the structures obtained by Al-Hamdani et. al. 35 , whereby the molecule is oriented on top of an N site and the geometry has been optimized using the optB86b-vdW functional. The water-N distance was set to 3.2 Å. The hBN monolayer is modeled by 32 atoms in the periodic cell and the distance between two BN sheets was set to 16 Å. After checking convergence, we employed a 500 eV kinetic energy cutoff for the one particle orbitals along with Γ point sampling of the Brillouin zone. The B 2s 2 2p 1 , N 2s 2 2p 3 , O 2s 2 2p 4 , and H 1s 1 states have been treated as valence states. Occupied HF states were converged within the full plane wave basis, whereas the virtual orbitals were constructed using Dunning's contracted aug-cc-pVDZ (AVDZ) and aug-cc-pVTZ (AVTZ) 36,37 pseudized Gaussians in a plane wave representation, orthogonalized to the occupied space 34 . Pseudized Gaussians have proven to work reliably and efficiently for surface studies on periodic systems 34 . Counterpoise corrections to the basis set superposition error (BSSE) were included in all correlated calculations. The adsorption energy is defined as the difference in energy between the noninteracting fragments and the interacting system We note that in Ref. 35 the adsorption energy has been calculated as the difference between the total energy of water and hBN at the largest possible oxygen-surface distance of 8 Å and the total energy of water and hBN at the adsorption oxygen-surface distance. Initially, we investigate the convergence of the adsorption energy with respect to the number of momentum grid points N G , employed to evaluate the Coulomb ver-texΓ q rG according to Eq. (7). The selection of the plane waves vectors G is determined by a kinetic energy cutoff E χ such that For this purpose we utilize the pseudized AVDZ basis set for the virtual orbitals. The current system consists of  We conclude that a cutoff energy of 200 eV is sufficient to converge the adsorption energy to within 2-3 meV. We then investigate how accurately can the optimized auxiliary field approximate the Coulomb vertexΓ q rG for the current system at the level of CCD theory. First we obtain the optimized Coulomb vertex Γ q rF = U * G FΓ q rG , where U F G consists of the left singular values ofΓ q rG associated to the N F largest singular values, according to Eq. (8). The optimized Coulomb vertex is expected to be efficient since most of the space in the simulation cell is vacant, and the plane wave auxiliary basis contains redundant information. Different number of field variables N F were employed to approximate the plane wave vectors N G for the various cutoff energies. The behavior of the adsorption energy with respect to the number of field variables is shown in Fig. 2. The rapid convergence of the energy with increasing number of field variables owes to the locality of the molecular orbitals in the supercell. The adsorption energy obtained with a cutoff of 200 eV can be calculated within 0.5 meV accuracy using N F = 1450 field variables to approximate N G = 4504 plane wave vectors.
Since we can conclude that the adsorption energy can be computed within approximately 3 meV using a cutoff energy of 200 eV for the auxiliary plane wave basis and N F = 1450 field variables to construct the optimized Coulomb vertex, we chose these settings to assess the accuracy of the low rank factorization of the Coulomb integrals. We used different number of vertex indices R to compute the factor orbitals and Coulomb factors of the Coulomb vertex, following Eq. (13). We approximate the particle/particle ladder contribution T cd ij V ab cd in the am-plitude equation of CCD via the factor orbitals and the Coulomb factors as shown in Eq. (39). The adsorption energy versus the number of the decomposed vertex indices N R is shown in Fig. 3. The energy does not converge monotonically as in the case of the optimized auxiliary field approximation. This is due to the nonlinear nature of the canonical polyadic decomposition of the Coulomb vertex and the random initial choice for its factors. Nevertheless we observe a converged behavior with increasing N R . Furthermore, a value of N R = 4350 = 3N F is sufficient to yield an adsorption energy within 1 meV accuracy. This suggests that the TRD of the Coulomb vertex is a controllable approximation that can yield increasingly accurate results with increasing decomposition rank N R . In order to further validate the accuracy of the TRD method we show the convergence of the absolute energy of the interacting system with respect to the decomposition rank N R in the inset of Fig. 3. We observe an exponential convergence of the total energy. An accuracy better than 0.1% is achieved already with N R = 2N F . Nevertheless we stress that the corresponding accuracy in the adsorption energy is a result of an error cancellation of one to two orders of magnitude.
Having assessed the accuracy of the TRD we now calculate the adsorption energy of the water molecule on hBN using the AVTZ pseudized Gaussian basis set. The evaluation involves the decomposition of a Coulomb vertex with N o = 68, N v = 1564, N F = 0.33N G = 1450, and N R = 3N F = 4350. The computational cost to obtain the decomposed matrices is roughly 3000 CPU hours with 256 iterations. The results of the adsorption energy are shown in Table I. In order to grasp a physical insight of the system, we compare the CCD results with RPA+SOSEX and MP2 calculations 40 .MP2 theory usually overestimates dispersion driven interactions, although in the description of BN bilayer interaction is fortuitously accurate 41 . Consequently, one expects MP2 theory to slightly overestimate the adsorption energy, whereas RPA+SOSEX is likely to yield a very accurate estimate. It is not surprising that CCD  underbinds the water molecule, since there exist findings that indicate the inability of CCD for an accurate description. Higher levels of theories, such as inclusion of the single excitations and the perturbative triples, are required for a more appropriate treatment. Nevertheless, the purpose of the current work is to examine the accuracy and efficiency of the newly developed methodologies rather than the accuracy of the method itself. The CPU hours required for the CCD calculations obtained with and without the TRD technique are summarized in Table II. The time for the evaluation of the particle/particle ladder term per iteration is as much as 43 times faster using a decomposition with N R = 2N F and 22 times faster with N R = 3N F . This constitutes a significant gain in the computational effort of coupled cluster methods with only slight compromise in accuracy.

VI. CONCLUSIONS
In this work we have outlined an algorithm to obtain a low rank tensor approximation of the Coulomb integrals having the same algebraic structure as its definition from the molecular orbitals: The factorization is obtained by fitting Λ S F Π * q S Π S r to auxiliary three index quantities referred to as Coulomb vertices that are calculated from a resolution of identity approach using a plane wave basis set. In this manner the scaling of the computational cost for obtaining the low rank tensor approximation with respect to system size does not exceed O(N 4 ). To reduce the prefactor of the computational cost further we have outlined an approach to further compactify the representation of the Coulomb vertices. We linearly transform the momentum index of the Coulomb vertices into a (truncated) basis referred to as an optimized auxiliary field. The accuracy of this truncation is systematically improvable using a single parameter that is used for the truncation of a singular value decomposition. The tensor factorization of the transformed Coulomb vertices is achieved using a regularized alternating least squares algorithm that converges rapidly using about 10 2 iterations only. In contrast, the nonregularized alternating least squares algorithm would require 10 5 -10 6 iterations. We stress that we employ no prior assumptions for the real space grids used for expanding the low order tensors.
Once obtained, the tensor factorization of the Coulomb integrals can be employed to reduce the scaling of the computational cost of distinguishable coupled cluster theory to O(N 5 ) without further approximations. We demonstrate that the factorization can also be used to reduce the computational cost for evaluating the computationally most expensive term (particle/particle ladder diagram) in the coupled cluster doubles amplitude equations for the case of water adsorption on the hBN monolayer system. For system sizes containing 136 electrons in 1632 orbitals we achieve substantial reductions in the computational cost that are on the order of a factor 10-20 without compromising the accuracy and introducing any further approximation.
Future work will focus on combining the outlined techniques with explicitly correlated methods and finite size corrections in order to significantly expand the scope of periodic coupled cluster theory calculations using plane wave basis sets for solid state systems 42,43 .