CP2K: An electronic structure and molecular dynamics software package - Quickstep: Efficient and accurate electronic structure calculations

CP2K is an open source electronic structure and molecular dynamics software package to perform atomistic simulations of solid-state, liquid, molecular, and biological systems. It is especially aimed at massively parallel and linear-scaling electronic structure methods and state-of-the-art ab initio molecular dynamics simulations. Excellent performance for electronic structure calculations is achieved using novel algorithms implemented for modern high-performance computing systems. This review revisits the main capabilities of CP2K to perform efficient and accurate electronic structure simulations. The emphasis is put on density functional theory and multiple post–Hartree–Fock methods using the Gaussian and plane wave approach and its augmented all-electron extension.


I. INTRODUCTION
The geometric increase in the performance of computers over the last few decades, together with advances in theoretical methods and applied mathematics, has established * matthias.krack@psi.ch computational science as an indispensable technique in chemistry, physics, life and materials sciences. In fact, computer simulations have been very successful in explaining a large variety of new scientific phenomena, interpret experimental measurements, predict materials properties and even rationally design new systems. Therefore, conducting experiments in silico permits to investigate arXiv:2003.03868v2 [physics.chem-ph] 11 Mar 2020 systems atomistically that otherwise would be too difficult, expensive or simply impossible to perform. However, the by far most rewarding outcome of such simulations is the invaluable insights they provide into the atomistic behavior and the dynamics. Therefore, electronic structure theory based ab-initio molecular dynamics (AIMD) can be sought of as a computational microscope [1][2][3].
The open source electronic structure and molecular dynamics (MD) software package CP2K aims at providing a broad range of computational methods and simulation approaches, suitable for extended condensed-phase systems. The latter is made possible by combining efficient algorithms with excellent parallel scalability to exploit modern high-performance computing architectures. However, beside conducting efficient large-scale AIMD simulations, CP2K provides a much broader range of capabilities, which includes the possibility of choosing the most adequate approach for a given problem and the flexibility of combining computational methods.
The remaining of the manuscript is organized as follows. The Gaussian and plane wave (GPW) approach to density functional theory (DFT) is reviewed in section II, before Hartree-Fock and beyond Hartree-Fock methods are covered in sections III and IV, respectively. Thereafter, density functional perturbation theory (DFPT) and timedependent DFT (TD-DFT) are described in sections V and VI. Sections VII and XII are devoted to low-scaling eigenvalue solver based on sparse matrix linear algebra using the DBCSR library. Conventional orthogonal localized orbitals, non-orthogonal localized orbitals (NLMO), absolutely localized molecular orbitals (ALMO) and compact localized molecular orbtials (CLMO) are discussed in section VIII to facilitate linear-scaling AIMD, whose key concepts are detailed in section IX. Energy decomposition and spectroscopic analysis methods are presented in section X, followed by various embedding techniques, which are summarized in section XI. Interfaces to other programs and technical aspects of CP2K are specified in sections XIII and XIV, respectively.

II. GAUSSIAN AND PLANE WAVES METHOD
The electronic structure module Quickstep [4,5] in CP2K can handle a wide spectrum of methods and approaches. Semi-empirical (SE) and tight-binding (TB) methods, orbital-free and Kohn-Sham DFT (KS-DFT) and wavefunction-based correlation methods (e.g. MP2, dRPA, GW) all make use of the same infrastructure of integral routines and optimization algorithms. In this section, we give a brief overview of the methodology that sets CP2K apart from most other electronic structure programs, namely its use of a plane wave (PW) auxiliary basis set within a Gaussian orbital scheme. As many other programs, CP2K uses contracted Gaussian basis sets g(r) to expand orbital functions ϕ(r) = u d u g u (r), (1) where the contraction coefficients d u are fixed and the primitive Gaussians are centered at atomic positions. These functions are defined by the exponent α, the spherical harmonics Y lm with angular momentum (l, m) and the coordinates of its center A. The unique properties of Gaussians, e.g. analytic integration or the product theorem, are exploited in many programs. In CP2K we make use of an additional property of Gaussians, namely, that their Fourier transform is again a Gaussian function, i.e.
This property is directly connected with the fact, that integration of Gaussian functions on equidistant grids shows exponential convergence with grid spacing. In order to take advantage of this property we define, within a computational box or periodic unit cell, a set of equidistant grid points The three vectors a 1 , a 2 , and a 3 define a computational box with a 3 × 3 matrix h = [a 1 , a 2 , a 3 ] and a volume Ω = det h. Furthermore, N is a diagonal matrix with entries 1/N s , where N s is the number of grid points along vector s = 1, 2, 3, whereas q is a vector of integers ranging from 0 to N s − 1. Reciprocal lattice vectors b i , defined by b i · a j = 2πδ ij can also be arranged into a 3 × 3 matrix [b 1 , b 2 , b 3 ] = 2π(h t ) −1 which allows us to define reciprocal space vectors where g is a vector of integer values. Any function with periodicity given by the lattice vectors and defined on the real-space points R can be transformed into a reciprocal space representation by the Fourier transform The accuracy of this expansion is given by the grid spacings or the PW cutoff defining the largest vector G included in the sum.
In the GPW method [6], the equidistant grid, or equivalently the PW expansion within the computational box, is used for an alternative representation of the electron density. In the KS method, the electron density is defined by n(r) = µν P µν ϕ µ (r)ϕ ν (r), (7) where the density matrix P µν = i f i c µi c νi is calculated from the orbital occupations f i and the orbital expansion coefficients c µi of the common linear combination of atomic orbitals Φ i (r) = µ c µi ϕ µ (r). Therein, Φ i (r) are the so-called molecular orbitals (MOs) and ϕ µ (r) the atomic orbitals (AOs). In the PW expansion, however, the density is given by The definitions given above allow us to calculate the expansion coefficients n(G) from the density matrix P µν and the basis functions ϕ µ (r). The dual representation of the density is used in the definition of the GPW-based KS energy expression (see section II A) to facilitate efficient and accurate algorithms for the electrostatic, as well as exchange and correlation energies. The efficient mapping P µν → n(G) is achieved by using multigrid methods, optimal screening in real-space, and the separability of Cartesian Gaussian functions in orthogonal coordinates. Details of these algorithms that result in a linear scaling algorithm with small prefactor for the mapping is described elsewhere [4][5][6].
A. Kohn-Sham energy, forces, and stress tensor The KS electronic energy functional [7] for a molecular or crystalline system within the supercell or Γ-point approximation in the GPW framework [4][5][6] is defined as where E kin is the kinetic energy, E ext is the electronic interaction with the ionic cores (see section II B), E ES is the total electrostatic (Coulomb) energy and E XC is the exchange-correlation (XC) energy. An extension of Eq. (9b) to include a k-point sampling within the first Brillouin zone is also available in CP2K. This implementation follows the methods from Pisani [8], Scuseria [9] and coworkers. The electrostatic energy is calculated using an Ewald method [10]. A total charge density is defined by adding Gaussian charge distributions of the form to the electronic charge distribution n(r). Self and overlap terms as generated by this Gaussian distributions, have to be compensated. The double sum for E ovrl runs over unique atom pairs and has to be extended, if necessary, beyond the minimum image convention. The electrostatic term also includes an interaction of the compensation charge with the electronic charge density that has to be subtracted from the external potential energy. The correction potential is of the form The treatment of the electrostatic energy terms and the XC energy is the same as in PW codes [10]. This means that the same methods that are used in those codes to adapt Eq. (9b) for cluster boundary conditions can be used here. This includes analytic Green's function methods [10], the method of Eastwood and Brownrigg [11], the methods of Martyna and Tuckerman [12,13], and wavelet approaches [14,15]. Starting from n(R), using Fourier transformations we can calculate the combined potential for the electrostatic and XC energies. This includes the calculation of the gradient of the charge density needed for generalized gradient approximation (GGA) functionals. For non-local van der Waals functionals [16], we use the Fourier transform based algorithm of Soler et al. [17]. The combined potential calculated on the grid points V XC (R) is then the starting point to calculate the KS matrix elements This inverse mapping V XC (R) → H XC µν is using the same methods that have been used for the charge density. Also, in this case we can achieve linear scaling with small prefactors. same type, so the same routine can be used. Similarly, for XC functionals including the kinetic energy density, we can calculate τ (r), the corresponding potential and matrix representation using the same mapping functions.
For the internal stress tensor we use again the Fourier transform framework for the E XC terms and the simple virial for all pair forces. This applies to all Pulay terms and only for the XC energy contributions from GGA functionals special care is needed due to the cell dependent grid integration [19][20][21].

B. Dual-space pseudopotentials
The accuracy of the PW expansion of the electron density in the GPW method is controlled by cutoff value E cut restricting the maximal allowed value of |G| 2 . In case of a Gaussian basis sets the cutoff needed to get a given accuracy is proportional to the largest exponent. As can been easily seen by inspecting common Gaussian basis sets for elements of different rows in the periodic table, the value of the largest exponent rapidly increases with atomic number. Therefore, the prefactor in GPW calculations will increase similarly. In order to avoid this, we can either resort to a pseudopotential description of inner shells, or use a dual representation as described in Blöchl's projector augmented-wave method (PAW) [22]. The pseudopotentials used together with PW basis sets are constructed to generate nodeless atomic valence functions. Fully non-local forms are computationally more efficient and easier to implement. Dual-space pseudopotentials are of this form and are, because of their analytic form consistent of Gaussian functions, easily applied together with Gaussian basis sets [23][24][25].
The pseudopotentials are given in real-space as and a non-local part are Gaussian-type projectors resulting in a fully analytical formulation that requires only the definition of a small parameter set for each element. Moreover, the pseu-dopotentials are transferable and norm-conserving. The pseudopotential parameters are optimized with respect to atomic all-electron wavefunction obtained from relativistic density functional calculations using a numerical atom code, which is also part of CP2K. A database with many pseudopotential parameter sets optimized for different XC potentials is available together with the distribution of the CP2K program.

C. Basis sets
The use of pseudopotentials in the GPW method also requires the use of correspondingly adapted basis sets. In principle, the same strategies that have been used to generate molecular or solid-state Gaussian basis sets could be used. It is always possible to generate specific basis sets for an application type, e.g. for metals or molecular crystals, but for ease of use a general basis set is desirable. Such a general basis should fulfill the following requirements. High accuracy for smaller basis sets and a route for systematic improvements. One and the same basis set should perform in various environments from isolated molecules to condensed phase systems. Ideally, the basis sets should lead for all systems to well conditioned overlap matrices and be therefore well suited for linear scaling algorithms. To fulfill all the above requirements, generally contracted basis sets with shared exponents for all angular momentum states were proposed [26]. In particular, a full contraction over all primitive functions is used for both valence and polarization functions. The set of primitive functions includes diffuse functions with small exponents that are mandatory for the description of weak interactions. However, in contrast to the practice used in augmented basis sets, these primitive functions are always part of a contraction with tighter functions. Basis sets of this type were generate according to a recipe that includes global optimization of all parameters with respect to the total energy of a small set of reference molecules [26]. The target function was augmented with a penalty function that includes the overlap condition number. These basis sets of type molopt have been created for the first two rows of the periodic table. They show good performance for molecular systems, liquids and dense solids. Results for the Delta test are shown in Fig. 15. The basis sets have 7 primitive functions with a smallest exponent of ≈ 0.047 for oxygen. This is very similar to the smallest exponent found in augmented basis sets of the correlation consistent type, e.g. ≈ 0.060 for aug-cc-pVTZ. The performance of the grid mapping routines depends strongly on this most diffuse function in the basis sets. We have therefore optimized a second set of basis sets of the molopt type, where the number of primitives has been reduced (e.g. 5 for first row elements) which also leads to less diffuse functions. The smallest exponent for oxygen in this basis set is now ≈ 0.162. These basis sets still show good performance in many different environments and are especially suited for all types of condensed matter systems (e.g. see Delta test results in section XIV D). The reduction of Gaussian primitives and the removal of very diffuse functions leads to a 10-fold time reduction for the mapping routines for liquid water calculations using default settings.

D. Local density fitting approach
Within the GPW approach, the mapping of the electronic density on the grid is often the time determining step. With an appropriate screening this is a linear scaling step with a prefactor determined by the number of overlapping basis functions. Especially in condensed phase systems, the number of atom pairs that have to be included can be very large. For such cases it can be beneficial, to add an intermediary step in the density mapping. In this step, the charge density is approximated by another auxiliary Gaussian basis. The expansion coefficients are determined using the local density fitting approach by Baerends et al. [27]. They introduced a local metric, where the electron density is decomposed into pair-atomic densities, which are approximated as a linear combination of auxiliary functions localized at atoms A and B. The expansion coefficients are obtained by employing an overlap metric. This local resolution-of-the-identity (LRI) method combined with GPW is available in CP2K as the LRIGPW approach [28]. For details how to set-up such a calculation see Ref. 29.
The atomic pair densities n AB are approximated by an expansion in a set of Gaussian fit functions f (r) centered at atoms A and B, respectively. The expansion coefficients are obtained for each pair AB by fitting the exact density while keeping the number of electrons fixed. This leads to a set of linear equations that can be solved easily. The total density is then represented by the sum of coefficients of all pair expansions on the individual atoms. The total density is now presented as a sum over the number of atoms, whereas in GPW we have a sum over pairs. In the case of 64 water molecules in a periodic box, this means that the fitted density is mapped on the grid by 192 atom terms rather than ≈ 200000 atom pairs. LRIGPW requires the calculation of two-and threeindex overlap integrals that is computationally demanding for large auxiliary basis sets. To increase the efficiency of the LRIGPW implementation, we developed an integral scheme based on solid harmonic Gaussian functions [30], which is superior to the widely used Cartesian Gaussianbased methods.
An additional increase in efficiency can be achieved by recognizing that most of the electron density is covered by a very small number of atom pair densities. The large part of more distant pairs can be approximated by an expansion on a single atom. By using a distance criteria and a switching function, a method with a smooth potential energy is created. The single atom expansion reduces memory requirements and computational time considerably. In the above water box example about 99% of the pairs can be treated as distant pairs. E. Gaussian-augmented plane waves approach An alternative to pseudopotentials, or a method to allow for smaller cutoffs in pseudopotential calculations is provided by the Gaussian-augmented plane waves (GAPW) approach [31,32]. The GAPW method uses a dual representation of the electronic density, where the usual expansion of the density using the density matrix P is replaced in the calculation of the Coulomb and XC energy by The densitiesñ(r), n A (r), andñ A (r) are expanded in plane waves and products of primitive Gaussians centered on atom A, respectively, i.e.
In Eq. (19a),ñ(G) are the Fourier coefficients of the soft density, as obtained in the GPW method by keeping in the expansion of the contracted Gaussians only those primitives with exponents smaller than a given threshold. The expansion coefficients P A mn , andP A mn are also functions of the density matrix P αβ and can be calculated efficiently. The separation of the density from Eq. (18) is borrowed from the PAW approach [22]. Its special form allows the separation of the smooth parts, characteristic of the interatomic regions, from the quickly varying parts close to the atoms, while still expanding integrals over all space. The sum of the contributions in Eq. (18) gives the correct full density if the following conditions are fulfilled The first conditions are exactly satisfied only in the limit of a complete basis set. However, the approximation introduced in the construction of the local densities, can be systematically improved by choosing larger basis sets.
For semi-local XC functionals such as the local density approximation (LDA), GGA or meta functionals using the kinetic energy density, the XC energy can be simply written as (21) The first term is calculated on the real-space grid defined by the PW expansion and the other two are efficiently and accurately calculated using atom centered meshes.
Due to the non-local character of the Coulomb operator, the decomposition for the electrostatic energy is more complex. In order to distinguish between local and global terms, we need to introduce atom-dependent screening densities n 0 A that generate the same multipole expansion Q lm A as the local density n A −ñ A + n Z A , where n Z A is the nuclear charge of atom A, i.e.
The normalized primitive Gaussians g lm A (r) are defined with an exponent such that they are localized within an atomic region. Since the sum of local densities n A −ñ A + n Z A − n 0 A has vanishing multiple moments, it does not interact with charges outside the localization region, and the corresponding energy contribution can be calculated by one-center integrals. The final form of the Coulomb energy in the GAPW method then reads as where n 0 is summed over all atomic contributions, and E H [n] denotes the Coulomb energy of a charge distribution n. The first term in Eq. (23) can be calculated efficiently using fast Fourier transform (FFT) methods using the GPW framework. The one-centered terms are calculated on radial atomic grids.
The special form of the GAPW energy functional in-volves several additional approximations in addition to a GPW calculation. The accuracy of the local expansion of the density is controlled by the flexibility of the product basis of primitive Gaussians. As we fix this basis to be the primitive Gaussians present in the original basis we cannot independently vary the accuracy of the expansion. Therefore, we have to consider this approximation as inherent to the primary basis used.
With the GAPW method it is possible to calculate materials properties that depend on the core electrons. This has been used for the simulation of the X-ray scattering in liquid water [33]. X-ray absorption spectra are calculated using the transition potential method [34][35][36][37]. Several nuclear and electronic magnetic properties are also available [38,39].

III. HARTREE-FOCK AND HYBRID DENSITY FUNCTIONAL THEORY METHODS
Even though, semi-local DFT is a cornerstone of much of condensed phase electronic structure modeling, it is also recognized that going beyond GGA-based DFT is necessary to improve the accuracy and reliability of electronic structure methods. One path forward is to augment DFT by elements of wavefunction theory, or to adopt wavefunction theory itself. This is the approach taken in successful hybrid XC functionals such as B3LYP or HSE [40,41], where part of the exchange functional is replaced by exact Hartree-Fock exchange (HFX). The capability to compute HFX was introduced in CP2K by Guidon et al. [42][43][44]. The aim at that time was to enable the use of hybrid XC functionals for condensed phase calculations, of relatively large, disordered systems, in the context of AIMD simulations. This objective motivated a number of implementation choices and developments that will be described in the following. The capability was particularly important to make progress in the field of first principles electro-chemistry [45,46], but is also the foundation for the correlated wavefunction methods such as MP2, RPA and GW that are available in CP2K and will be described in section IV.
In the periodic case, HFX can be computed as where an explicit sum over the k-points is retained and a generalized Coulomb operator g(|r 1 − r 2 |) is employed. The k-point sum is important, as at least for the standard Coulomb operator 1 r , the term for k = k = 0 (the so-called Γ-point) is singular, though the full sum approximates an integrable expression. If the operator is short-ranged or screened, the Γ-point term is well-behaved. CP2K computes HFX at the Γ-point only and employs a localized atomic basis set, using an expression where the sum is explicit over the indices of the localized basis (λσµν), as well as the image cells (abc), thus For an operator with finite range, the sum over the image cells will terminate. This expression was employed to perform hybrid DFT-based AIMD simulations of liquid water with CP2K [42]. Several techniques have been introduced to reduce the computational cost. First, screening based on the overlap of basis functions is employed to reduce the scaling of the calculation from O(N 4 ) to O(N 2 ). This does not require any assumptions on the sparsity of the density matrix, nor the range of the operator, and makes HFX feasible for fairly large systems. Second, the HFX implementation in CP2K is optimized for 'in-core' calculations, where the four center integrals are computed (analytically) only once at the beginning of the SCF procedure, stored in main memory, and reused afterwards. This is particularly useful in the condensed phase, as the sum over all image cells multiplies the cost of evaluating the integral, relative to gas phase calculations. To store all computed integrals, the code has been very effectively parallelized using MPI and OMP, yielding super-linear speed-ups as long as added hardware resources provide additional memory to store all integrals. Furthermore, a specialized compression algorithm is used to store each integral with just as many bits as needed to retain the target accuracy. Third, a multiple-time-step (MTS) algorithm (see section IX D) is employed to evaluate HFX energies only every few time-steps during an AIMD simulation, assuming that the difference between the potential energy surface of a GGA and a hybrid XC functional is slowly varying with time. This technique has found reuse in correlated wavefunction simulations described in section IV. In Ref. 43, the initial implementation was revisited, in particular to be able to robustly compute HFX at the Γ-point for the case where the operator in the exchange term is 1 r , and not a screened operator as e.g. in HSE [47,48]. The solution is to truncate the operator (not any of the image cell sums), with a truncation radius R C that grows with the cell size. The advantage of this approach is that the screening of all terms in Eq. 25 can be performed rigorously, and that the approach is stable for a proper choice of screening threshold, also in the condensed phase with good quality (but non-singular) basis sets. The value of R C that yields convergence is system dependent, and large values of R C might require the user to explicitly consider multiple unit cells for the simulation cell. Note that the HFX energy converges exponentially with R C for typical insulating systems, and that the same approach was used previously to accelerate k-point convergence [49]. In Ref. 50, it was demonstrated that two different codes (CP2K and Gaussian), with very different implementations of HFX, could reach micro-Hartree agreement for the value of the HF total energy of the LiH crystal. In Ref. 43, a suitable GGA-type exchange function was derived that can be used as a long range correction (LRC) together with the truncated operator. This correction functional, in the spirit of the the exchange functional employed in HSE, effectively allows for model chemistries that employ very short range exchange (e.g. ≈ 2Å) only.
Important for the many applications of HFX with CP2K is the auxiliary density matrix method (ADMM) method, introduced in Ref. 44. This method reduces the cost of HFX significantly, often bringing it to within a few times the cost of conventional GGA-based DFT, by addressing the unfavorable scaling of the computational cost of Eq. 25 with respect to basis set size. The key ingredient of the ADMM method is the use of an auxiliary density matrix P, which approximates the original P, for which the HFX energy is more cost-effective to compute: Effectively, E HF X X [P ] is replaced with by computationally more efficient E HF X X [P ], and the difference between the two is corrected approximately with a GGA-style exchange functional. Commonly, the auxiliaryP is obtained by projecting the density matrix P using a smaller, auxiliary basis. This approximation, including the projection, can be implemented fully self-consistently. In Ref. 51, the efficiency of the ADMM method was demonstrated by computing over 300 ps of AIMD trajectory for systems containing 160 water molecules, and by computing spin densities for a solvated metallo-protein system containing approximately 3000 atoms [44].

IV. BEYOND HARTREE-FOCK METHODS
In addition, so-called post-Hartree-Fock methods that are even more accurate than the just describe hybrid-DFT approach are also available within CP2K.

Theory
Second-order Møller-Plesset perturbation theory (MP2) is the simplest ab-initio correlated wavefunction method [52], applied to the Hartree-Fock reference and able to capture most of the dynamic electron correlation [53]. In the DFT framework, the MP2 formalism gives rise to the doubly-hybrid XC functionals [54]. In the spin-orbital basis, the MP2 correlation energy is given by: where i, j, ... run over occupied spin-orbitals, a, b, ... run over virtual spin-orbitals (indexes without bar stand for αspin-orbitals, indexes with bar for β-spin-orbitals), ∆ ab ij = a + b − i − j ( a and i are orbital energies) and (ia|jb) are electron repulsion integrals in the Mulliken notation.
In a canonical MP2 energy algorithm, the time limiting step is the computation of the (ia|jb) integrals obtained from the electron repulsion integrals over AOs (µν|λσ) via four consecutive index integral transformations. The application of the resolution-of-identity (RI) approximation to MP2 [55], which consists of replacing (ia|jb) integrals with the approximated (ia|jb) RI , is given by where P, R, ... (the total number of them is N a ) are auxiliary basis functions and L are two-center integrals over them. The RI-MP2 method is also scaling O(N 5 ) with a lower prefactor: the main reason for the speed-up in RI-MP2 lies in the strongly reduced integral computation cost.
As the MP2 is non-variational with respect to wavefunction parameters, analytical expressions for geometric energy derivatives of RI-MP2 energies are complicated, since its calculation requires the solution of Z-vector equations [56].

Scaled opposite-spin MP2
The scaled opposite-spin MP2 (SOS-MP2) method is a simplified variant of MP2 [57]. Starting from Eq. 27, we neglect the same spin term in curly brackets and scale the remaining opposite spin term to account for the introduced error. We can rewrite the energy term with the RI approximation and the Laplace transform When we exploit a numerical integration, we obtain the working equation for SOS-MP2, which reads as with and similarly for the beta spin. The integration weights w q and abscissa t q are determined by a minimax procedure [58]. In practice, 7 quadrature points are needed for µHartree accuracy [57]. The most expensive step is the calculation of the matrix elements Q α P Q which scales like O(N 4 ). Due to similarities with the random phase approximation (RPA), we will discuss the implementation of this method in section IV B.
The implementation is massively parallel, takes the advantage of sparse matrix algebra and allows for GPU acceleration of large matrix multiplies. The evaluation of the gradients of the RI-MP2 energy can be performed within a few minutes for systems containing hundreds of atoms and thousands of basis functions on thousands of CPU cores, allowing for MP2-based structure relaxation and even AIMD simulations on HPC facilities. The cost of the gradient calculation is 4-5 times larger than the energy evaluation and open-shell MP2 calculation is typically 3-4 times more expensive than the closed-shell calculation.

Applications
The RI-MP2 implementation, which is the computationally most efficient MP2 variant available in CP2K, has been successfully applied to study a number of aqueous systems. In fact, ab-initio Monte Carlo (MC) and AIMD simulations of bulk liquid water (with simulation cells containing 64 water molecules) predicted the correct density, structure and IR spectrum [63][64][65]. Other applications include structure refinements of ice XV [66], AIMD simulations of the bulk hydrated electron (with simulation cells containing 47 water molecules) [67], as well as the first AIMD simulation of a radical in the condensed phase using wavefunction theory.

B. Random Phase Approximation Correlation Energy Method
Total energy methods based on the RPA correlation energy have emerged in the recent years as promising approaches to include non-local dynamical electron correlation effects at the fifth rung on the Jacob's ladder of density functional approximations [68]. In this context, there are numerous ways to express the RPA correlation energy depending on the theoretical framework and approximations employed to derive the working equations [69][70][71][72][73][74][75][76][77][78][79][80][81][82][83][84][85][86][87]. Our implementation uses the approach introduced by Eshuis et al. [88], which can be referred to as based on the dielectric matrix formulation, involving the numerical integral over frequency of a logarithmic expression including the dynamical dielectric function, expressed in an Gaussian RI auxiliary basis. Within this approach, the direct-RPA (sometimes referred as dRPA) correlation energy, which is a RPA excluding exchange contributions [88], is formulated as a frequency integral Tr(ln(1 + Q(ω)) − Q(ω)), (32) with the frequency dependent matrix Q(ω), expressed in the RI basis, determined by For a given ω, the computation of the integrand function in Eq. 32 and using Eq. 33 requires O(N 4 ) operations.
The integral of Eq. 32 can be efficiently calculated by a minimax quadrature requiring only 10 integration points to achieve µHartree accuracy [89,90]. Thus, the introduction of the RI approximation and the frequency integration techniques for computing E RPA c leads to a computational cost of O(N 4 N q ) and O(N 3 ) storage, where N q is the number of quadrature points [60].
We note here that the conventional RPA correlation energy methods in CP2K are based on the the exact exchange (EXX) and RPA correlation energy formalism (EXX/RPA), which has extensively been applied to a large variety of systems including molecules, systems with We summarize here the implementation in CP2K of the quartic scaling computation of the RPA and SOS-MP2 correlation energies. The reason to describe the two implementations here is due to the fact that the two approaches share several components. In fact, it can be shown that the direct MP2 energy can be obtained by truncating at the first non-vanishing term of the Taylor expansion to compute the logarithm in Eq. 32 and integrating over frequencies [88].
After the three center RI integral matrix B ia P is made available (via i.e. RI-GPW [60]), the key component of both methods is the evaluation of the frequency dependent Q P Q (ω) for the RPA and the τ dependent Q P Q (τ ) for the Laplace transformed SOS-MP2 method (see section IV A 2). The matrices Q P Q (ω) and Q P Q (τ ) are given by the contractions in Eqs. 33 and 31, respectively. Their computation entails, as a basic algorithmic motif, a large distributed matrix multiplication between tall and skinny matrices for each quadrature point. Fortunately, the required operations at each quadrature point are independent of each other. The parallel implementation in CP2K exploits this fact by distributing the workload for the evaluation of Q P Q (ω) and Q P Q (τ ) over pools of processes, where each pool is working independently on a subset of quadrature points. Furthermore, the operations necessary for each quadrature point are performed in parallel within all members of the pool. In this way, the O(N 4 ) bottleneck of the computation displays an embarrassingly parallel distribution of the workload and in fact, it shows excellent parallel scalability to several thousand nodes [60,89]. Additionally, since at the pool level the distributed matrix multiplication employs the widely adopted data layout of the parallel BLAS library, minimal modifications are required to exploit accelerators (such as graphics processing units (GPUs) and field-progammable gate arrays (FPGAs), see section XIV A for details), as interfaces to the corresponding accelerated libraries are made available [89].
Finally, the main difference between RPA and SOS-MP2 is the postprocessing. After the contraction step to obtain the Q P Q matrix, the sum in Eq. 30 is performed with computational costs of O(N 2 a N q ) for SOS-MP2 instead of O(N 3 a N q ), which is associated with the evaluation of the matrix logarithm of Eq. 32 for the RPA (in CP2K this operation is performed using the identity Tr[ln A] = ln(det[A]), where a Cholesky decomposition is used to efficiently calculate the matrix determinant). Therefore, the computational costs of the quartic-scaling RPA and SOS-MP2 are the same for large systems assuming the same number of quadrature points is used for the numerical integration.

Cubic Scaling RPA and SOS-MP2 method
The scaling of RPA and SOS-MP2 can be reduced from O(N 4 ) to O(N 3 ), or even better by alternative analytical formulations of the methods [103,104]. Here we describe the CP2K-specific cubic scaling RPA/SOS-MP2 implementation and demonstrate the applicability to systems containing thousands of atoms.
For cubic scaling RPA calculations, the matrix Q P Q (ω) from Eq. 33 is transformed to imaginary-time Q P Q (τ ) [90], as it is already present in SOS-MP2. The tensor B ia P is transformed from occupied-virtual MO pairs ia to pairs µν of AO basis set functions. This decouples the sum over occupied and virtual orbitals and thereby reduces the formal scaling from quartic to cubic. Further requirements for a cubic scaling behavior are the use of localized atomic Gaussian basis functions and the localized overlap RI metric such that the occurring 3-center integrals are sparse. A sparse representation of the density matrix is not a requirement for our cubic scaling implementation but it reduces the effective scaling of the usually dominant O(N 2 ) sparse tensor contraction steps to O(N ) [103].
The operations performed for the evaluation of Q P Q (τ ) are generally speaking contractions of sparse tensors of ranks 2 and 3 -starting from the 3-center overlap integrals and the density matrix. Consequently, sparse linear algebra is the key to good performance, as opposed to the quartic scaling implementation that relies mostly on parallel BLAS for dense matrix operations.
The cubic scaling RPA/SOS-MP2 implementation is based on the distributed block compressed sparse row (DBCSR) library [105], which is described in detail in section XII and was originally co-developed with CP2K, to enable linear scaling DFT [106]. The library was extended with a tensor API in a recent effort to make it more easily applicable to algorithms involving contractions of large sparse multi-dimensional tensors. Block-sparse tensors are internally represented as DBCSR matrices, whereas tensor contractions are mapped to sparse matrix-matrix multiplications. An in-between tall-and-skinny matrix layer reduces memory requirements for storage and reduces communication costs for multiplications by splitting the largest matrix dimension and running a simplified variant of the CARMA algorithm [107]. The tensor extension of the DBCSR library leads to significant improvements in terms of performance and usability compared to the initial implementation of the cubic scaling RPA [104].
In Fig. 1, we compare the computational costs of the quartic and the cubic scaling RPA energy evaluation for periodic water systems of different sizes. All calculations use the RI together with the overlap metric and a highquality cc-TZVP primary basis with a matching RI basis. The neglect of small tensor elements in the O(N 3 ) implementation is controlled by a filtering threshold parameter. This parameter has been chosen such that the relative error introduced in the RPA energy is below 0.01%. The favorable effective scaling of O(N 1.8 ) in the cubic scaling implementation leads to better absolute timings for systems of 100 or more water molecules. At 256 water molecules, the cubic scaling RPA outperforms the quartic scaling method by one order of magnitude.
The observed scaling is better than cubic in this example because the O(N 3 ) scaling steps have a small prefactor and would start to dominate in systems larger than the ones presented here -they make up for around 20% of the execution time for the largest system. The dominant sparse tensor contractions are quadratic scaling, closely matching the observed scaling of O(N 1.8 ). It is important to mention that the density matrices are not yet becoming sparse for these system sizes. Lower dimensional systems 32 256 864 Number of water molecules with large extend in one or two dimensions have an even more favorable scaling regime of O(N ) since the onset of sparse density matrices occurs at smaller system sizes.
All aspects of the comparison discussed here also apply to SOS-MP2 because it shares the algorithm and implementation of the dominant computational steps with the cubic scaling RPA method. In general, the exact gain of the cubic scaling RPA/SOS-MP2 scheme depends on the specifics of the applied basis sets (locality and size). The effective scaling, however, is O(N 3 ) or better for all systems, irrespective of whether the density matrix has a sparse representation, thus extending the applicability of the RPA to large systems containing thousands of atoms.

C. Ionization potentials and electron affinities from GW
The GW approach, which allows to approximately calculate the self-energy of a many-body system of electrons, has become a popular tool in theoretical spectroscopy to predict electron removal and addition energies as measured by direct and inverse photoelectron spectroscopy, respectively, see Figs. 2(a) and (b) [108][109][110]. In direct photoemission the electron is ejected from orbital ψ n to the vacuum level by irradiating the sample with visible, ultraviolet or X-rays, whereas in the inverse photoemission process an injected electron undergoes a radiative transition into an unoccupied state.
The GW approximation has been applied to a wide range of materials including two-dimensional systems, surfaces and molecules. A summary of applications and an introduction to the many-body theory behind GW and practical aspects can be found in a recent review article [110]. The electron removal energies are referred to as ionization potentials (IPs) and the negative of the electron addition energies as electron affinities (EAs); see Ref. 110 for details on the sign convention. The GW method yields a set of energies {ε n } for all occupied and unoccupied orbitals {ψ n }. For occupied states, ε n can be directly related to the IP and for the lowest unoccupied state (LUMO) to the EA. Hence, The difference between the IP of the highest occupied state (HOMO) and the EA is called the fundamental gap, a critical parameter for charge transport in, e.g, organic semiconductors [111]. It should be noted that the fundamental HOMO-LUMO gap does not correspond to the first optical excitation energy (also called optical gap) that is typically smaller than the fundamental gap due to the electron-hole binding energy [112]. For solids, the Bloch functions ψ nk carry an additional quantum number k in the first Brillouin zone and give rise to a bandstructure ε nk [109]. From the bandstructure we can determine the band gap, which is the solid-state equivalent to the HOMO-LUMO gap. Unlike for molecules, an angleresolved photoemission experiment is required to resolve the k-dependence.
For GW , mean absolute deviations of less than 0.2 eV from the higher-level CCSD(T) reference have been reported for IP HOMO and EA [113,114]. The deviation from the experimental reference can be even reduced to < 0.1 eV when including also vibrational effects [115]. On the contrary, DFT eigenvalues ε DFT n fail to reproduce spectroscopic properties. Relating ε DFT n to the IPs and EA as in Eq. 35, is conceptually only valid for the HOMO level [116]. Besides the conceptual issue, IPs from DFT eigenvalues are underestimated by several eV compared to experimental IPs due to the self-interaction error in GGA and LDA functionals yielding far too small HOMO-LUMO gaps [117,118]. Hybrid XC functionals can improve the agreement with experiment, but the amount of exact exchange can strongly influence ε DFT n in an arbitrary way. The most common GW scheme is the G 0 W 0 approximation, where the GW calculation is performed non-selfconsistently on top of an underlying DFT calculation. In The self-energy is computed from the Green's function G and the screened Coulomb interaction W , Σ = GW , hence the name of the GW approximation [110]. The G 0 W 0 implementation in CP2K [119] works routinely for molecules despite first attempts also have been made for periodic systems [67,120,121]. The standard G 0 W 0 implementation is optimized for computing valence orbital energies ε n (e.g. up to 10 eV below the HOMO and 10 eV above the LUMO) [119]. The frequency integration of the self-energy is based on the analytic continuation using either the 2-pole model [122], or Padé approximant as analytic form [123]. For core levels, more sophisticated implementations are necessary [124,125]. The standard implementation scales with O(N 4 ) with respect to system size N and enables the calculation of molecules up to a few hundreds atoms on parallel supercomputers. Large molecules exceeding thousand atoms can be treated by the low-scaling G 0 W 0 implementation within CP2K [126], which effectively scales with O(N 2 ) to O(N 3 ).
The GW equations should be in principle solved selfconsistently. However, a fully self-consistent treatment is computationally very expensive. In CP2K, an approximate self-consistent scheme is available, where the wavefunctions ψ DFT n from DFT are employed and only the eigenvalues in G and W are iterated, for details see Refs. 110,117,and 119. This eigenvalue-self-consistent scheme (evGW ) includes more physics than G 0 W 0 , but is still computationally tractable. Depending on the underlying DFT functional, evGW improves the agreement of the HOMO-LUMO gaps with experiment by 0.1 − 0.3 eV compared to G 0 W 0 [117,119]. For lower lying states the improvement over G 0 W 0 might be not as consistent [127].
Applications of the GW implementation in CP2K have been focused on graphene-based nanomaterials on gold surfaces complementing scanning tunneling spectroscopy (STS) with GW calculations to validate the molecular geometry and to obtain information about the spin configuration [128]. The excitation process generates an additional charge on the molecule. As a response, an image charge (IC) is formed inside the metallic surface which causes occupied states to move up in energy and unoccupied states to move down. The HOMO-LUMO gap of the molecule is thus significantly lowered on the surface compared to the gas phase, see Fig. 2(c). This effect has been accounted for by an IC model [129], which is implemented in CP2K. Adding the IC correction on-top of a gas phase evGW calculation of the isolated molecule, CP2K can efficiently compute HOMO-LUMO gaps of physisorbed molecules on metal surfaces [128].

V. DENSITY FUNCTIONAL PERTURBATION THEORY
Several experimental observables are measured by perturbing the system and then observing its response, hence they can be obtained as derivatives of the energy or density with respect to some specific external perturbation. In the common perturbation approach, the perturbation is included in the Hamiltonian, i.e. as an external potential, then the electronic structure is obtained by applying the variational principle and the changes on the expectation values are evaluated. The perturbation Hamiltonian defines the specific problem. The perturbative approach applied in the framework of DFT turns out to perform better than the straightforward numerical methods, where the total energy is computed after actually perturbing the system. For all kinds of properties related to derivatives of the total energy, DFPT is derived from the following extended energy functional where the external perturbation is added in the form of a functional and λ is a small perturbative parameter representing the strength of the interaction with the static external field [130,131]. The minimum of the functional is expanded perturbatively in powers of λ as whereas the corresponding minimising orbitals are If the expansion of the wavefunction up to an order (n−1) is known, then the variational principle for the 2nth-order derivative of the energy is given by under the constraint For zero-order, the solution is obtained from the ground state KS equations. The second-order energy is variational in the first-order wavefunction and is obtained as The electron density is also expanded in powers of λ and the first-order term reads as The Lagrange multipliers Λ ij are the matrix elements of the zeroth-order Hamiltonian, which is the KS Hamiltonian. Hence, and the second-order energy kernel is where E Hxc represents the sum of the Hartree and the XC energy functionals. Thus, the evaluation of the kernel requires the second-order functional derivative of the XC functionals.
The second-order energy is variational with respect to the {φ (1) i }, where the orthonormality condition of the total wavefunction gives at the first-order This also implies the conservation of the total charge.
The perturbation functional can often be written as the expectation value of a perturbation Hamiltonian However, the formulation through an arbitrary functional also allows orbital specific perturbations. The stationary condition then yields the inhomogeneous, non-linear system for {φ where j | is the projector upon the unoccupied states. Note that the right hand side still depends on the {φ (1) i } via the perturbation density n (1) . In our implementation the problem is solved directly using a preconditioned conjugate-gradient (CG) minimisation algorithm.

A. Polarizability
One case where the perturbation cannot be expressed in a Hamiltonian form is the presence of an external electricfield, which couples with the electric polarisation P el = e r , where i r is the expectation value of the position operator for the system of N electrons. In the case of periodic systems, the position operator is ill-defined, and we use the modern theory of polarization in the Γ-pointonly to write the perturbation in terms of the Berry phase where the matrix is defined as and h = [a, b, c] is the 3×3 matrix defining the simulation cell and h α = (a α , b α , c α ) [132,133]. The electric dipole is then given by Through the coupling to an external electric-field E ext , this induces a perturbation of the type where the perturbative parameter is the field component I | can be evaluated using the formula for the derivative of a matrix with respect to a generic variable [131], which gives the perturbative term as Once the first-order correction to the set of the KS orbitals has been calculated, the induced polarization is while the elements of the polarizability tensor are obtained as α αβ = ∂P el α /∂E β . The polarizability can be looked as the deformability of the electron cloud of the molecule by the electricfield. In order for a molecular vibration to be Raman active, the vibration must be accompanied by a change in the polarizability. In the usual Placzeck theory, ordinary Raman scattering intensities can be expressed in terms of the isotropic transition polarizability α i = 1 3 Tr[α], and the anisotropic transition polarizability α a = αβ 1 2 (3α αβ α αβ − α αα α ββ ) [134]. The Raman scattering cross section can be related to the dynamical autocorrelation function of the polarazability tensor. Along AIMD simulations, the polarizability can be calculated as a function of time [135,136]. As the vibrational spectra are obtained by the temporal Fourier transformation of the velocity autocorrelation function, and the IR spectra from that of the dipole autocorrelation function, the depolarized Raman intensity can be calculated from the autocorrelation of the polarizability components [137].

B. Nuclear magnetic resonance and electron paramagnetic resonance spectroscopy
The development of the DFPT within the GAPW formalism allows for an all-electron description, which is important when the induced current density generated by an external static magnetic perturbation is calculated. The so induced current density determines at any nucleus A the nuclear magnetic resonance (NMR) chemical shift and, for systems with net electronic spin 1/2, the electron paramagnetic resonance (EPR) g-tensor In the above expressions, R A is the position of the nucleus, j α is the current density induced by a constant external magnetic-field applied along the α axis, and g e is the free electron g-value. Among the different contributions to the g-tensor, the current density dependent ones are the spin-orbit (SO) interaction and the spin-other-orbit (SOO) interaction where which also depends on the spin density n spin and the spin-current density j spin . Therein, V ↑ eff is an effective potential in which the spin up electrons are thought to move (similarly V ↓ eff for spin down electrons), whereas B corr αβ is the β component of the magnetic-field originating from the induced current density along α. The SO-coupling is the leading correction term in the computation of the g-tensor. It is relativistic in origin and therefore becomes much more important for heavy elements. In the current CP2K implementation the SO term is obtained by integrating the induced spin-dependent current densities and the gradient of the effective potential over the simulation cell. The effective one-electron operator replaces the computationally demanding two-electrons integrals [138]. A detailed discussion on the impact of the various relativistic and SO approximations, which are implemented in the various codes, is provided by Van Yperen-De Deyne et al. [139].
In the GAPW representation, the induced current density is decomposed with the same scheme applied for the electron density distinguishing among the soft contribution to the total current density, and the local hard and local soft contributions, i.e.
In the linear response approach, the perturbation Hamiltonian at the first-order in the field strength is where p is the momentum operator and A is the vector potential representing the field B. Thus, with the cyclic variable d(r) being the gauge origin. The induced current density is calculated as the sum of orbital contributions j i and can be separated in a diamagnetic term . Both contributions individually are gauge dependent, whereas the total current is gauge-independent. The position operator appears in the definition of the perturbation operators and of the current density. In order to be able to deal with periodic systems, where the multiplicative position operator is not a valid operator, first we perform a unitary transformation of the ground state orbitals to obtain their maximally localised Wannier functions (MLWFs) counterpart [140]. Hence, we use the alternative definition of the position operator, which is unique for each localised orbital and showing a sawtooth-shaped profile centered at the orbital's Wannier center [141].
Since we work with real ground state MOs, in the unperturbed state, the current density vanishes. Moreover, the first-order perturbation wavefunction is purely imaginary, which results in a vanishing first-order change in the electronic density n (1) . This significantly simplifies the perturbation energy functional, since the second-order energy kernel can be skipped. The system of linear equations to determine the matrix of the expansion coefficients of the linear response orbitals C (1) reads as where i and j are the orbital indexes, ν and µ are basis set function indexes, and S νµ is an element of the overlap matrix. The optional subindex (j), labelling the matrix element of the perturbation operator, indicates that the perturbation might be orbital dependent. In our CP2K implementation [38], the formalism proposed by Sebastiani et al. is employed [141], i.e. we split the perturbation in three different types of operators, which are L = (r − d j ) × p, the orbital angular momentum operator p, the momentum operator and ∆ i = (d i − d j ) × p, the full correction operator. The vector d j is the Wannier center associated with the unperturbed j-th orbital, thus making L and ∆ i dependent on the unperturbed orbital to which they are applied. By using the Wannier center as relative origin in the definition of L, we introduce an individual reference system, which is then corrected by the ∆ i . As a consequence, the response orbitals are given by nine sets of expansion coefficients, as for each operator all three Cartesian components need to be individually considered. The evaluation of the orbital angular momen-tum contributions and of the momentum contributions can be done at the computational cost of just one total energy calculation. The full correction term, instead, requires one such calculation for each electronic state. This term does not vanish unless all d i are equal. However, in most circumstances this correction is expected to be small, since it describes the reaction of the state i to the perturbation of state j, which becomes negligible when the two states are far away. Once all contributions have been calculated, the x-component of the current density induced by the magnetic-field along y is where the first term is the paramagnetic contribution and the second the diamagnetic one. The convergence of the magnetic properties with respect to Gaussian basis set size is strongly dependent on the choice of the gauge. The available options in CP2K are the individual gauge for atoms in molecules (IGAIM) approach introduced by Keith and Bader [142], and the continuous set of gauge transformation (CSGT) approach [143]. The diamagnetic part of the current density vanishes identically when the CSGT approach is used, i.e. d(r = r). Yet, this advantage is weakened by the rich basis set required to obtain an accurate description of the current density close to the nuclei, which typically affects the accuracy within the NMR chemical shift. In the IGAIM approach, however, the gauge is taken at the closer nuclear center. Large-scale computations of NMR chemical shifts for extended paramagnetic solids are reported by Mondal et al. [39]. They show that the contact, pseudocontact, and orbital-shift contributions to paramagnetic NMR can be calculated by combining hyperfine couplings obtained with hybrid functionals with g-tensors and orbital shieldings computed using gradient-corrected functionals.

VI. TIME-DEPENDENT DENSITY FUNCTIONAL THEORY
The dynamics and properties of many-body systems in the presence of time-dependent potentials, such as electric or magnetic fields, can be investigated via TD-DFT.
A. Linear-response time-dependent density functional theory Linear-response TD-DFT (LR-TDDFT) [144] is an inexpensive correlated method to compute vertical transition energies and oscillator strengths between the ground and singly-excited electronic states. Optical properties are computed as a linear-response of the system to a perturbation caused by an applied weak electro-magnetic field.
where ω is a transition energy, X a response eigenvector, and A a response operator. In addition, the adiabatic approximation postulates independence of the employed XC functional on time and leads to the following form for the matrix elements of the response operator [149]: In the above equation, the indices i and j stand for occupied spin-orbitals, whereas a and b indicated virtual spin-orbitals, and σ and τ refers to specific spin components. The terms (i σ a σ |j τ b τ ) and (i σ a σ |f xc;στ |j τ b τ ) are standard electron repulsion and XC integrals over KS orbital functions {φ(r)} with corresponding KS orbital energies , hence Here, the XC kernel f xc;στ (r, r ) is simply the second functional derivative of the XC functional E xc over the ground state electron density n (0) (r) [150], hence To solve the Eq. (65), the current implementation uses a block Davidson iterative method [151]. This scheme is flexible enough and allows to tune performance of the algorithm. In particular, it supports hybrid exchange functionals along with many acceleration techniques (see section III), such as integral screening [42], truncated Coulomb operator [43], and ADMM [44]. Whereas in most cases the same XC functional is used to compute both the ground state electron density and the XC kernel, separate functionals are also supported. This can be used, for instance, to apply a long-term correction to the truncated Coulomb operator during the LR-TDDFT stage [43], when such correction has been omitted during the reference ground state calculation. The action of the response operator on the trial vector X for a number of excited states may also be computed simultaneously. This improves load-balancing and reduces communication costs allowing a larger number of CPU cores to be effectively utilized.

Applications
The favorable scaling and performance of the LR-TDDFPT code has been exploited to calculate the excitation energies of various systems with 1D, 2D and 3D periodicities, auch as cationic defects in aluminosilicate imogolite nanotubes [152], as well as surface and bulk canonical vacancy defects in MgO and HfO 2 [145]. Throughout, the dependence of results on the fraction of Hartree-Fock exchange was explored and the accuracy of the ADMM approximation verified. The performance was found to be comparable to ground state calculations for systems of ≈ 1000 atoms, which were shown to be sufficient to converge localized transitions from isolated defects, within these medium to wide band gap materials.

B. Real-time time-dependent density functional theory and Ehrenfest dynamics
Alternatively to perturbation based methods, real-time propagation-based TDDFT is also available in CP2K. The real-time TDDFT formalism allows to investigate non-linear effects and can be used to gain direct insights into the dynamics of processes driven by the electron dynamics. For systems in which the coupling between of electronic and nuclear motion is of importance, CP2K provides the option to propagate cores and electrons simultaneously using the Ehrenfest scheme. Both methods are implemented in a cubic-and linear scaling form. The cubic scaling implementation is based on MO coefficients (MO-RTP), whereas the linear scaling version acts on the density matrix (P-RTP). While the derivation of the required equations for origin independent basis functions is rather straightforward, additional terms arise for atom centred basis sets [153]. The time-evolution of the MO coefficients in a nonorthonormal Gaussian basis reads aṡ whereas the corresponding nuclear equations of motion is given by (70) Therein, S and H are the overlap and KS matrices, whereas M I and R I are the position and mass of ion I and U(R,t) is the potential energy of the ionion interaction. The terms involving these variables represent the basis set independent part of the equations of motion. The additional terms containing matrices B and D are arising as a consequence of the origin dependence and are defined as follows: with . For the P-RTP approach, the exponential is applied from both sides, which allows the expansion into a series similar to the Baker-Campbell-Hausdorff expansion. This expansion is rapidly converging and in theory only requires sparse matrix-matrix multiplications. Unfortunately, pure linear scaling Ehrenfest dynamics seems to be impossible due to the non-exponential decay of the density matrix during such simulations [154]. This leads to a densification of the involved matrices and eventually cubic scaling with system size. However, the linear scaling version can be coupled with subsystem DFT to achieve true linear scaling behaviour. In subsystem DFT, as implemented in CP2K, follows the approach of Kim and Gordon [155,156]. In this approach, the different subsystems are minimised independently with the XC functional of choice. The coupling between the different subsystem is added via a correction term using a kinetic energy functional: where ρ is the electron density of the full system and ρ i the one of subsystem i. Using an orbital-free density functional to compute the interaction energy does not affect the structure of the overlap, which is block diagonal in this approach. If P-RTP is applied to the Kim-Gordon density matrix, the block diagonal structure is preserved and linear scaling with the number of subsystems is achieved.

VII. DIAGONALIZATION-BASED AND LOW-SCALING EIGENSOLVER
After the initial Hamiltonian matrix for the selected method has been built by CP2K, like the KS matrix K in case of a DFT-based method using the Quickstep module, the calculation of the total (ground state) energy for the given atomic configuration is the next task. This requires an iterative self-consistent field (SCF) procedure, as the Hamiltonian depends usually on the electronic density. In each SCF step, the eigenvalues and at least the eigenvectors of the occupied MOs have to be calculated. Various eigensolver schemes are provided by CP2K for that task: • Traditional diagonalization (TD) • Pseudo diagonalization (PD) [157] • Orbital Transformation (OT) method [158] • Purification methods [159] The latter method, OT, is the method of choice concerning computational efficiency and scalability for growing system sizes. However, OT requires fixed integer MO occupations. Therefore, it is not applicable for systems with a very small or no band gap, like metallic systems, which need fractional ("smeared") MO occupations. Also, very large systems for which even the scaling of the OT method becomes unfavorable, one has to resort to linearscaling methods (see section VII E). By contrast, TD can also be applied with fractional MO occupations, but its cubic-scaling (N 3 ) limits the accessible system sizes. In that respect, PD may provide significant speedups (factor of 2 or more) once a pre-converged solution has been obtained with TD. In the following, we will restrict the description of the implemented eigensolver methods to spin-restricted systems, since the generalization to spin-unrestricted, i.e. spinpolarized systems is straightforward and CP2K can deal with both types of systems using each of these methods.

A. Traditional diagonalization
The TD scheme in CP2K employs an eigensolver either from the parallel program library ScaLAPACK (Scalable Linear Algebra PACKage) [160], or the ELPA library (Eigenvalue soLvers for Petascale Applications) [161,162] to solve the general eigenvalue problem where K is the KS and S is the overlap matrix. The eigenvectors C represent the MO coefficients, and the are the corresponding MO eigenvalues. Unlike to PW codes, the overlap matrix S is not the unit matrix, since CP2K/Quickstep employs atom-centered basis sets of non-orthogonal Gaussian-type functions (see section II C). Thus, the eigenvalue problem is transformed to its special form using a Cholesky decomposition of the overlap matrix as the default method for that purpose. Now, Ref. 75c can simply be solved by a diagonalization of K . The MO coefficient matrix C in the non-orthogonal basis are finally obtained by the back-transformation Alternatively, a symmetric Löwdin orthogonalization instead of a Cholesky decomposition can be applied [163], i.e.
On the one hand, the calculation of S 1/2 as implemented in CP2K involves, however, a diagonalization of S which is computationally more expensive than a Cholesky decomposition. On the other hand, however, linear dependencies in the basis set introduced by small Gaussian function exponents can be detected and eliminated when S is diagonalized. Eigenvalues of S smaller than 10 −5 usually indicate significant linear dependencies in the basis set and a filtering of the corresponding eigenvectors can help to ameliorate numerical difficulties during the SCF iteration procedure. For small systems, the choice of the orthogonalisation has no crucial impact on the performance since it has to be performed only once for each configuration during the initialization of the SCF run.
Only the occupied MOs are required for the built-up of the density matrix P in the AO basis This saves not only memory, but also computational time since the orthonormalization of the eigenvectors is a timeconsuming step. Usually only 10-20 % of the orbitals are occupied when standard Gaussian basis sets are employed with CP2K.

TD/DIIS
The TD scheme is combined with methods to improve the convergence of the SCF iteration procedure. The most efficient SCF convergence acceleration is achieved by the direct inversion in the iterative sub-space (DIIS) [164,165] exploiting the commutator relation between the KS and the density matrix, where the error matrix e is zero for the converged density. The DIIS method can be very efficient in the number of iterations required to reach convergence starting from a sufficiently pre-converged density, which is significant if the cost of constructing the Hamiltonian matrix is larger than the cost of diagonalization.

TD/Broyden and Kerker mixing
Yet, the DIIS method has frequently problems to converge for instance metallic systems because of an imbalance between the short-and long-range charge redistribution of consecutive SCF steps. Am effect that is also called "charge sloshing".

B. Pseudo diagonalization
Alternatively to TD, a pseudo diagonalization can be applied as soon as a sufficiently pre-converged wavefunction is obtained [157]. The KS matrix K AO in the AO basis is transformed into the MO basis in each SCF step via using the MO coefficients C from the preceding SCF step. The converged K MO matrix using TD is a diagonal matrix and the eigenvalues are its diagonal elements. Already after a few SCF iteration steps, the K MO matrix becomes diagonally dominant. Moreover, the K MO matrix shows the following natural blocking oo ou uo uu (82) due to the two MO sub-sets of C, namely the occupied (o) and the unoccupied (u) MOs. The eigenvectors are used during the SCF iteration to calculate the new density matrix (see Eq. 79), whereas the eigenvalues are not needed. The total energy only depends on the occupied MOs and thus a block diagonalization, which decouples the occupied and unoccupied MOs allows to converge the wavefunctions. As a consequence, only all elements of the block ou or uo have to become zero, since K MO is a symmetric matrix. Hence, the transformation into the MO basis has only to be performed for that matrix block. Then the decoupling can be achieved iteratively by consecutive 2 × 2 Jacobi rotations, i.e.
where the angle of rotation θ is determined by the difference of the eigenvalues of the MOs p and q divided by the corresponding matrix element K MO pq in the ou or uo block. The Jacobi rotations are computationally cheap as they can be performed with BLAS level 1 routines (DSCAL and DAXPY). The oo block is significantly smaller than the uu block, since only 10-20% of the MOs are occupied using a standard basis set. Consequently, the ou or uo block also includes only 10-20% of the K MO matrix. Furthermore, an expensive re-orthonormalization of the MO eigenvectors is not needed, since the Jacobi rotations preserve the orthonormality of the MO eigenvectors. Moreover, the number of non-zero blocks decreases rapidly when convergence is approached which results in a decreasing compute time for the PD part.

C. Orbital transformations
An alternative to the just described diagonalizationbased schemes are algorithms that relies on a direct minimization of the electronic energy functional [158,[166][167][168][169][170][171]. Convergence of this approach can in principle be guar-anteed if the energy can be reduced in each step. The direct minimization approach is thus more robust. It also replaces the diagonalization step by having fewer matrix-matrix multiplications, significantly reducing the time-to-solution. This is of great importance for many practical problems, in particular large systems that are difficult or sometimes even impossible to tackle with DIISlike methods. However, preconditioners are often used in conjunction with direct energy minimization algorithms to facilitate faster and smoother convergence.
The calculation of the total energy within electronic structure theory can be formulated variationally in terms of an energy functional of the occupied single-particle orbitals that are constrained with respect to an orthogonality condition. With M orbitals, C ∈ R N ×M is given in a nonorthogonal basis consisting of N basis functions The corresponding constrained minimization problem is expressed as where E[C] is an energy functional, C * is the minimizer of E[C] that fulfills the condition of orthogonality where P = CC T is the density matrix, whereas h, J and K are the core Hamiltonian, the Coulomb, and Hartree-Fock exact exchange matrices, respectively, and E XC [P] is the XC energy. Enforcing the orthogonality constraints within an efficient scheme poses a major hurdle in the direct minimization of E[C]. Hence, in the following, we describe two different approaches implemented in CP2K [158,171].

Orthogonality constraints
a. Orbital transformation functions: OT/Diag and OT/Taylor To impose the orthogonality constraints on the orbitals, VandeVondele and Hutter reformulated the non-linear constraint on C (see Eq. 85) by a linear constraint on the auxiliary variable X via and where X ∈ R N ×M , and C 0 is a set of initial orbitals that fulfill the orthogonality constraints C T 0 SC 0 = 1 [158]. The OT is parametrized as follows: where U = X T SX 1/2 . This parametrization ensures that C T XSCX = 1, for all X satisfying the constraints X T SC 0 = 0. The matrix functions cos U and U −1 sin U are evaluated either directly by diagonalization, or by a truncated Taylor expansion in X T SX [172]. b. Orbital transformation based on a refinement expansion: OT/IR In this method, Weber et al. replaced the constrained functional by an equivalent unconstrained functional (C → f (Z)) [171]. The transformed minimization problem in Eq. 85 is then given by and where Z ∈ R N ×M . The constraints have been mapped onto the matrix function f (Z), which fulfills the orthogonality constraint f T (Z)Sf (Z) = 1 for all matrices Z. The main idea of this approach is to approximate the OT in Eq. 90 by f n (Z) ≈ f (Z), where f n (Z) is an approximate constraint function, which is correct up to some order The functions derived by Niklasson for the iterative refinement of an approximate inverse matrix factorization are used to approximate f (Z) [173]. The first few refinement functions are given by where Y = Z T SZ and Z = Z 0 + δZ. It can be shown that Using this general ansatz for f n (Z), it is possible to extend the accuracy to any finite order recursively by an iterative refinement expansion f n (· · · f n (Z) · · · ).

Minimizer
a. Direct inversion of the iterative subspace The DIIS method introduced by Pulay is an extrapolation technique based on minimizing the norm of a linear combination of gradient vectors [165]. The problem is given by where g i is an error vector and c * the optimal coefficients. The next orbital guess x m+1 is obtained by linear combination of the previous points using the optimal coefficients c * , i.e.
where τ is an arbitrary step size chosen for the DIIS method. The method simplifies to a steepest descent (SD) for the initial step m = 1. While the DIIS method converges very fast in most of the cases, it is not particularly robust. In CP2K, the DIIS method is modified to switch to SD when a DIIS step brings the solution towards an ascent direction. This safeguard makes DIIS more robust and is possible because the gradient of the energy functional is available. b. Non-linear conjugate gradient minimization Nonlinear CG leads to a robust, efficient and numerically stable energy minimization scheme. In non-linear CG, the residual is set to the negation of the gradient r i = −g i and the search direction is computed by Gram-Schmidt conjugation of the residuals, i.e.
Several choices for updating β i+1 are available and the Polak-Ribière variant with restart is used in CP2K [174,175]: Similar to linear CG, a step length α i is found that minimizes the energy function f (x i + α i d i ) using an approximate line search. The updated position becomes Quickstep, a few different line searches are implemented. The most robust is the golden section line search [176], but the default quadratic interpolation along the search direction suffices in most cases.
Regarding time-to-solution, the minimization performed with the latter quadratic interpolation is in general significantly faster than the golden section line search. Non-linear CG are generally preconditioned by choosing an appropriate preconditioner M that approximate f (see Section VII C 3).
c. Quasi-Newton method Newton's method can also be used to minimize the energy functional. The method is scale invariant, and the zigzag behavior that can be seen in the SD method is not present. The iteration for Newton's method is given by where H is the Hessian matrix. On the one hand, the method exhibits super-linear convergence when the initial guess is close to the solution and β = 1, but, on the other hand, when the initial guess is further away from the solution, Newton's method may diverge. This divergent behavior can be suppressed by the introduction of line search or backtracking. As they require the inverse Hessian of the energy functional, the full Newton's method is in general too time-consuming or difficult to use. Quasi-Newton methods [177], however, are advantageous alternatives to Newton's method when the Hessian is unavailable or is too expensive to compute. In those methods, an approximate Hessian is updated by analyzing successive gradient vectors. A quasi-Newton step is determined by where G k is the approximate inverse Hessian at step k. Different update formulas exist to compute G k [178]. In Quickstep, the Broydens type 2 update is implemented to construct the approximate inverse Hessian with an adaptive scheme for estimating the curvature of the energy functional to increase the performance [179].

Preconditioners
a. Preconditioning the non-linear minimization Gradient based OT methods are guaranteed to converge, but will exhibit slow convergence behavior if not appropriately preconditioned. A good reference for the optimization can be constructed from the generalized eigenvalue problem under the orthogonality constraint of Eq. 85 and its approximate second derivative Therefore, close to convergence, the best preconditioners would be of the form As this form is impractical, requiring a different preconditioner for each orbital, a single positive definite preconditioning matrix P is constructed approximating In CP2K, the closest approximation to this form is the FULL ALL preconditioner. It performs an orbital dependent eigenvalue shift of H. In this way, positive definiteness is ensured with minimal modifications. The downside is that the eigenvalues of H have to be computed at least once using diagonalization and thus scales as O(N 3 ).
To overcome this bottleneck several more approximate, though lower scaling preconditioners have been imple-mented within CP2K. The simplest assume = 1 and H = T, with T being the kinetic energy matrix (FULL KINETIC), or even H = 0 (FULL S INVERSE) as viable approximations. These preconditioners are obviously less sophisticated. However, they are linear-scaling in their construction as S and T are sparse and still lead to accelerated convergence. Hence, these preconditioners are suitable for large-scale simulations. For many systems, the best trade off between quality and cost of the preconditioner is obtained with the FULL SINLGE INVERSE preconditioner. Instead of shifting all orbitals separately, only the occupied eigenvalues are inverted. Thus, making the orbitals closest to the band gap most active in the optimization. The inversion of the spectrum can be done without the need for diagonalization by where δ represents an additional shift depending on the HOMO energy, which ensures positive definiteness of the preconditioner matrix. It is important to note that the construction of this preconditioner matrix can be done with a complexity of O(N M 2 ) in the dense case and is therefore of the same complexity as the rest of the OT algorithm. b. Reduced scaling and approximate preconditioning All of the above mentioned preconditioners still require the inversion of the preconditioning matrix P. In dense matrix algebra this leads to an O(N 3 ) scaling behavior independent of the chosen preconditioner. For large systems, the inversion of P will become the dominant part of the calculation when low-scaling preconditioners are used. As Schiffmann et al. has shown [180], sparse matrix techniques are applicable for the inversion of the low-scaling preconditioners and can be activated using the INVERSE UPDATE option as preconditioner solver. By construction, the preconditioner matrix is symmetric and positive definite. This allows for the use of Hotteling's iterations to compute the inverse of P as Generally, the resulting approximants to the inverse become denser and denser [181]. In CP2K this is dealt with aggressive filtering on P −1 i+1 . Unfortunately, there are limits to the filtering threshold. While the efficiency of the preconditioner is not significantly affected by the loss of accuracy (see section XIV B), the Hotteling iterations may eventually become unstable [182]. Using a special way to determine the initial alpha based on the extremal eigenvalues of PP −1 0 it can be shown that any positive definite matrix can be used as initial guess for the Hotteling iterations [183]. For MD simulations or geometry optimizations this means the previous inverse can be used as initial guess as the changes in P are expected to be small [184]. Therefore, only very few iterations are required until convergence after the initial approximation for the inverse is obtained.

D. Purification methods
Linear-scaling DFT calculations can also be achieved by purifying the KS matrix K directly into the density matrix P without using the orbitals C explicitly [159]. These density matrix-based methods exploit the fact that the KS matrix K, as well as the density matrix P, have by definition the same eigenvectors C, and that a purification maps eigenvalues i of K to eigenvalues f i of P via the Fermi-Dirac-function with the chemical potential µ, the Boltzmann constant k B and electron temperature T . In practice, the purification is computed by an iterative procedure that is constructed to yield a linear-scaling method for sparse KS matrices. By construction, such purifications are usually grandcanonical purifications so that additional measures such as modifications to the algorithms or additional iterations to find the proper value of the chemical potential are necessary to allow for canonical ensembles. In CP2K, the trace-resetting fourth-order method (TRS4 [185]), the trace-conserving second order method (TC2 [186]) and the purification via the sign function (SIGN [106]) are available. Additionally, CP2K implements an interface to the PEXSI (Pole EXpansion and Selected Inversion) library [187][188][189][190], which allows to evaluate selected elements of the density matrix as the Fermi-Dirac-function of the KS matrix via a pole expansion.

E. Sign-Method
The sign function of a matrix can be used as a starting point for the construction of various linear-scaling algorithms [191]. The relation together with iterative methods for the sign function, such as the Newton-Schulz iteration, are used by default in CP2K for linear-scaling matrix inversions and (inverse) square roots of matrices [192,193]. Several orders of iterations are available: the second-order Newton-Schulz iteration, as well as third-order and fifth-order iterations based on higher-order Padé-approximants [183,191,194].
The sign function can also be used for the purification of the Kohn-Sham matrix K into the density matrix P, i.e. via the relation cutoff for matrix elements ǫ filter matrix-sign matrix-sqrt Figure 3. Wall time for the calculation of the computationally dominating matrix-sqrt (blue, boxes) and matrix-sign (red, circles) functions as a function of matrix truncation threshold filter for the STMV virus. The latter contains more than one million atoms and was simulated using the periodic implementation in CP2K of the GFN2-xTB model [195]. The fifth-order sign-function iteration of Eq. 109a and the third-order sqrtfunction iterations have been used. The calculations have been performed with 256 nodes (10240 cpu-cores) of the Noctua system at the Paderborn Center for Parallel Computing (PC 2 ).
Within CP2K, the linear-scaling calculation of the signfunction is implemented up to seventh-order based on Padé-approximants. For example, the fifth-order iteration has the form and is implemented in CP2K with just four matrix multiplications per iteration. After each matrix multiplication, all matrix elements smaller than a threshold filter are flushed to zero to retain sparsity. The scaling of the wall-clock time for the computation of the sign-and sqrtfunctions to simulate the STMV virus in water solution using the GFN2-xTB Hamiltonian is shown in Fig. 3 [154]. The drastic speedup of the calculation, when increasing the threshold filter , immediately suggests the combination of sign-matrix iteration based linear-scaling DFT algorithms with the ideas of approximate computing (AC), as discussed in section XIV B.
Recently, a new iterative scheme for the inverse p-th of a matrix has been developed which also allows to directly evaluate the density matrix via the sign function in Eq. 106 [183]. An arbitrary-order implementation of this scheme is also available in CP2K.  Comparison of the wall time required for the purification using the submatrix method (red circles, with a direct eigendecomposition for the density matrix of the submatrices) and the 2nd order Newton-Schulz sign iteration (blue boxes) for different relative errors in total energy after one SCF iteration. The corresponding reference energy has been computed by setting filter = 10 −16 . All calculations have been performed on a system consisting of 6192 H2O molecules, using KS-DFT together with a SZV basis set, utilizing two compute nodes (80 cpu-cores) of the "Noctua" system at PC 2 .

F. Submatrix Method
In addition to the sign method, the submatrix method presented in Ref. 181 has been implemented in CP2K as an alternative approach to calculate the density matrix P from the KS matrix K. Instead of operating on the entire matrix, calculations are performed on principal submatrices thereof. Each of these submatrices covers a set of atoms that originates from the same block of the KS matrix in the DBCSR-format, as well as those atoms in their direct neighborhood whose basis functions are overlapping. As a result, the submatrices are much smaller than the KS matrix, but relatively dense. For large systems, the size of the submatrices is independent on the overall system size so that a linear-scaling method immediately results from this construction. Purification of the submatrices can be performed either using iterative schemes to compute the sign function (see section VII E), or via a direct eigendecomposition. The submatrix method provides an approximation of the density matrix P, whose quality and computational cost depends on the truncation threshold filter used during the SCF iterations. Fig. 4 compares the accuracy provided and wall time required by the submatrix method with accuracy and wall time of a Newton-Schulz sign iteration.

VIII. LOCALIZED MOLECULAR ORBITALS
Spatially localized molecular orbitals (LMOs), also known as MLWFs in solid state physics and materials science, are widely used to visualize chemical bonding between atoms, help classify bonds and thus understand electronic structure origins of observed properties of atomistic systems (see section X A). Furthermore, localized orbitals are a key igredient in multiple local electronic structure methods that dramatically reduce the computational cost of modeling electronic properties of large atomistic systems. LMOs are also important to many other electronic structure methods that require local states such as XAS spectra modeling or dispersion-corrected XC functionals based on atomic polarizabilities.
A. Localization of orthogonal and non-orthogonal molecular orbitals CP2K offers a variety of localization methods, in which LMOs |j are constructed by finding a unitary transformation A of canonical MOs |i 0 , either occupied or virtual, i.e.
which minimizes the spread of individual orbitals. CP2K uses the localization functional proposed by Resta [133], which was generalized by Berghold et al. to periodic cells of any shape and symmetry: wherer is the position operator in three dimensions, ω K and G K are suitable sets of weights and reciprocal lattice vectors, respectively [196]. The functional in Eq. (111a) can be used for both gas-phase and periodic systems.
In the former case, the functional is equivalent to the Boys-Foster localization [197]. In the latter case, its applicability is limited to the electronic states described within the Γ-point approximation. CP2K also implements the Pipek-Mezey localization functional [198], which can be written in the same form as the Berghold functional above with K referring to atoms, z K i measuring the contribution of orbital i to the Mulliken charge of atom K and B being defined as where |χ µ and |χ µ are atom-centered covariant and contravariant basis set functions [196]. The Pipek-Mezey functional has the advantage of preserving the separation of σ and π bonds and is commonly employed for molecular systems.
In addition to the traditional localization techniques, CP2K offers localization methods that produce nonorthogonal LMOs (NLMOs) [199]. In these methods, matrix A is not restricted to be unitary and the minimized objective function contains two terms: a localization functional Ω L given by Eq. (111a) and a term that penalizes unphysical states with linearly dependent localize orbitals where c P > 0 is the penalty strength and σ is the NLMO overlap matrix The penalty term varies from 0 for orthogonal LMOs to +∞ for linearly dependent NLMOs, making the latter inaccessible in the localization procedure with finite penalty strength c P . The inclusion of the penalty term converts the localization procedure into a straightforward unconstrained optimization problem and produces NLMOs that are noticeably more localized than their conventional orthogonal counterparts ( Figure 5).

B. Linear scaling methods based on localized one-electron orbitals
Linear-scaling, or so-called O(N ), methods described in section VII E exploit the natural locality of the oneelectron density matrix. Unfortunately, the variational optimization of the density matrix is inefficient if calcula-tions require many basis functions per atom. From the computational point of view, the variation of localized one-electron states is preferable to the density matrix optimization because the former requires only the occupied states, reducing the number of variational degrees of freedom significantly, especially in calculations with large basis sets. CP2K contains a variety of orbital-based O(N ) DFT methods briefly reviewed here and in section IX C.
Unlike density matrices, one-electron states tend to delocalize in the process of an unconstrained optimization and their locality must be explicitly enforced to achieve linearscaling. To this end, each occupied orbital is assigned a localization center -an atom (or a molecule) -and a localization radius R c . Then, each orbital is expanded strictly in terms of subsets of localized basis functions centered on the atoms (or molecules) lying within R c from the orbital's center. In CP2K, contracted Gaussian functions are used as the localized basis set.
While the localization constraints are necessary to design orbital-based O(N ) methods, the reduced number of electronic degrees of freedom results in the variationally optimal CLMO energy being always higher than the reference energy of fully delocalized orbitals. From the physical point of view, enforcing orbital locality prohibits the flow of electron density between distant centers and thus switches off the stabilizing donor-acceptor (i.e. covalent) component of interactions between them. It is important to note that the other interactions such as long-range electrostatics, exchange, polarization, and dispersion are retained in the CLMO approximation. Thereby, the accuracy of the CLMO-based calculations depends critically on the chosen localization radii, which should be tuned for each system to obtain the best accuracy-performance compromise. In systems with non-vanishing band gaps, the neglected donor-acceptor interactions are typically short-ranged and CLMOs can accurately represent their electronic structure if R c encompasses the nearest and perhaps next nearest neighbors. On the other hand, the CLMO approach is not expected to be practical for metals and semi-metals because of their intrinsically delocalized electrons.
The methods and algorithms in CP2K are designed to circumvent the known problem of slow variational optimization of CLMOs [204,[217][218][219][220][221], the severity of which rendered early orbital-based O(N ) methods impractical. Two solutions to the convergence problem are described here. The first approach is designed for systems without strong covalent interactions between localization centers such as ionic materials or ensembles of small weaklyinteracting molecules [208,209,220,222]. The second approach is proposed to deal with more challenging cases of strongly bonded atoms such as covalent crystals.
The key idea of the first approach is to optimize CLMOs in a two-stage SCF procedure. In the first stage, R c is set to zero and the CLMOs -they can be called ALMOs in this case -are optimized on their centers. In the second stage, the CLMOs are relaxed to allow delocalization onto the neighbor molecules within their localization radius R c . To achieve a robust optimization in the problematic second stage, the delocalization component of the trial CLMOs must be kept orthogonal to the fixed AL-MOs obtained in the first stage. If the delocalization is particularly weak, the CLMOs in the second stage can be obtained using the simplified Harris functional [223]orbital optimization without updating the Hamiltonian matrix -or non-iterative perturbation theory. The mathematical details of the two-stage approach can be found in Ref. 209. A detailed description of the algorithms is presented in the supplementary material of Ref. [210].
The two-stage SCF approach resolves the convergence problem only if the auxiliary ALMOs resembles the final variationally optimal CLMOs and, therefore, is not practical for systems with noticeable electron delocalizationin other words, covalent bonds -between atoms. The second approach, designed for systems of covalently bonded atoms, utilizes an approximate electronic Hessian that is inexpensive to compute and yet sufficiently accurate to identify and remove the problematic optimization modes. The accuracy and practical utility of this approach relies on the fact that the removed low-curvature modes are associated with the nearly-invariant occupied-occupied orbital mixing, which produce only an insignificant lowering in the total energy.
The robust variational optimization of CLMOs combined with CP2K's fast O(N ) algorithms for the construction of the KS Hamiltonian enabled the implementation of a series of O(N ) orbital-based DFT methods with a low computational overhead. Fig. 6 shows that R c can be tuned to achieve substantial computational savings without compromising the accuracy of the calculations. It also demonstrates that CLMO-based DFT exhibits early-offset linear-scaling behavior even for challenging condensed-phase systems and works extremely well with large basis sets.
The current implementation of the CLMO-based methods is limited to localization centers with closed-shell electronic configurations. The nuclear gradients are available only for the methods that converge the CLMO variational optimization. This excludes CLMO methods based on the Harris functional and perturbation theory.
Section IX C describes how the CLMO methods can be used in AIMD simulations by means of the secondgeneration Car-Parrinello MD (CPMD) method of Kühne and coworkers [3,224,225]. Energy decomposition analysis (EDA) methods that exploit the locality of compact orbitals to understand the nature of intermolecular interactions in terms of physically meaningful components are described in section X A.

C. Polarized atomic orbitals from machine learning
The computational cost of a DFT calculation depends critically on the size and condition number of the employed basis set. Traditional basis sets are atom centered, static, and isotropic. Since molecular systems are never isotropic, it is apparent that isotropic basis sets are suboptimal. The polarized atomic orbitals from machine learning (PAO-ML) scheme provides small adaptive basis sets, which adjust themselves to the local chemical environment [226]. The scheme uses polarized atomic orbitals (PAOs), which are constructed from a larger primary basis function as introduced by Berghold et al. [227]. A PAO basis functionφ µ can be written as a weighted sum of primary basis functions ϕ ν , where µ and ν belong to the same atom:φ The aim of the PAO-ML method is to predict the transformation matrix B for a given chemical environment using machine learning (ML). The analytic nature of ML models allows for the calculation of exact analytic forces, as they are required for AIMD simulations. In order to train such an ML model, a set of relevant atomic motifs with their corresponding optimal PAO basis are needed. This poses an intricate non-linear optimization problem as the total energy must be minimal with respect to the transformation matrix B, and the electronic density, while still obeying the Pauli principle. To this end, the PAO-ML scheme introduced an improved optimization algorithm based on the Li, Nunes, and Vanderbilt formulation for the generation of training data [228].
When constructing the ML model, great care must be taken to ensure the invariance of the predicted PAO basis sets with respect to rotations of the simulation cell, to prevent artificial torque forces. The PAO-ML scheme achieve rotational invariance by employing potentials anchored on neighboring atoms. The strength of individual potential terms is predicted by the ML model. Collectively the potential terms form an auxiliary atomic Hamiltonian, whose eigenvectors are then used to construct the transformation matrix B.
The PAO-ML method has been demonstrated by means of AIMD simulations of liquid water. A minimal basis set yielded structural properties in fair agreement with basis set converged results. In the best case, the computational cost were reduced by a factor of 200 and the required flops by 4 orders of magnitude. Already a very small training set gave satisfactory results as the variational nature of the method provides robustness.

IX. AB-INITIO MOLECULAR DYNAMICS
The mathematical task of AIMD is to evaluate the expectation value O of an arbitrary operator O(R, P ) with respect to the Boltzmann distribution where R and P are the nuclear positions and momenta, while β = 1/k B T is the inverse temperature. The total energy function where the first term denotes the nuclear kinetic energy, E[{ψ i }; R] the potential energy function, N the number of nuclei and M I the corresponding masses.
However, assuming the ergodicity hypothesis, the thermal average O can not only be determined as the ensemble average, but also as a temporal average by means of AIMD.
In the following, we will assume that the potential energy function is calculated on-the-fly using KS-DFT, In CP2K, AIMD comes in two distinct flavors, which are both outlined in this section.

A. Born-Oppenheimer molecular dynamics
In Born-Oppenheimer MD (BOMD) the potential energy E [{ψ i }; R] is minimized at every AIMD step with respect to {ψ i (r)} under the holonomic orthonormality constraint ψ i (r)|ψ j (r) = δ ij . This leads to the following Lagrangian: where Λ is a Hermitian Lagrangian multiplier matrix. By solving the corresponding Euler-Lagrange equations d dt one obtains the associated equations of motion The first term on the right-hand side (RHS) of Eq. (121a) is the so-called Hellmann-Feynman force [229,230]. The second term that is denoted as Pulay [18], or wavefunction force F WF , is a constraint force due to the holonomic orthonormality constraint, and is nonvanishing if, and only if, the basis functions φ j explicitly depend on R. The final term stems from the fact that, independently of the particular basis set used, there is always an implicit dependence on the atomic positions. The factor 2 in Eq. (121a) stems from the assumption that the KS orbitals are real, an inessential simplification. Nevertheless, the whole term vanishes whenever ψ i (R) is an eigenfunction of the Hamiltonian within the subspace spanned by the not necessarily complete basis set [231,232]. Note that this is a much weaker condition than the original Hellmann-Feynman theorem, of which we hence have not availed ourselves throughout the derivation, except as an eponym for the first RHS term of Eq. (121a). However, as the KS functional is nonlinear, eigenfunctions of its HamiltonianĤ e are only obtained at exact selfconsistency, which is why the last term of Eq. (121a) is also referred to as non-self-consistent force F NSC [233]. Unfortunately, this can not be assumed in any numerical calculation and results in immanent inconsistent forces as well as the inequality of Eq. (121b). Neglecting either F WF or F NSC , i.e. applying the Hellmann-Feynman theorem to a non-eigenfunction, leads merely to a perturbative estimate of the generalized forces which, contrary to the energies, depends just linearly on the error in the electronic charge density. That is why it is much more exacting to calculate accurate forces rather than total energies.

B. Second-generation Car-Parrinello molecular dynamics
Until recently, AIMD has mostly relied on two general methods: The original CPMD and the direct BOMD ap-proach, each with its advantages and shortcomings. In BOMD, the total energy of a system, as determined by an arbitrary electronic structure method, is fully minimized in each MD time step, which renders this scheme computationally very demanding [234]. By contrast, the original CPMD technique obviates the rather time-consuming iterative energy minimization by considering the electronic degrees of freedom as classical time-dependent fields with a fictitious mass µ that are propagated by the use of Newtonian dynamics [235]. In order to keep the electronic and nuclear subsystems adiabatically separated, which causes the electrons to follow the nuclei very close to their instantaneous electronic ground state, µ has to be chosen sufficiently small [236]. However, in CPMD the maximum permissible integration time step scales like ∼ √ µ and therefore has to be significantly smaller than that of BOMD, hence limiting the attainable simulation timescales [237]. The so-called second-generation CPMD method combines the best of both approaches by retaining the large integration time steps of BOMD, while, at the same time, preserving the efficiency of CPMD [3,224,225]. In this Car-Parrinello-like approach to BOMD, the original fictitious Newtonian dynamics of CPMD for the electronic degrees of freedom is substituted by an improved coupled electron-ion dynamics that keeps the electrons very close to the instantaneous ground-state without the necessity of an additional fictitious mass parameter. The superior efficiency of this method originates from the fact that only one preconditioned gradient computation is necessary per AIMD step. In fact, a gain in efficiency between one to two orders of magnitude has been observed for a large variety of different systems ranging from molecules and solids [238][239][240][241][242][243][244][245], including phase-change materials [246][247][248][249][250][251][252], over aqueous systems [253][254][255][256][257][258][259][260][261], to complex interfaces [262][263][264][265][266][267][268].
Within mean-field electronic structure methods, such as Hartree-Fock and KS-DFT, E [{ψ i }; R] is either a functional of the electronic wavefunction that is described by a set of occupied MOs |ψ i or, equivalently, of the corresponding one-particle density operator ρ = i |ψ i ψ i |. The improved coupled electron-ion dynamics of secondgeneration CPMD obeys the following equations of motion for the nuclear and electronic degrees of freedom: The former is the conventional nuclear equation of motion of BOMD consisting of Hellmann-Feynman, Pulay and non-self-consistent force contributions [18,229,230,233], whereas the latter constitutes an universal oscillator equation as obtained by a nondimensionalization. The first term on the RHS in Eq. 123b can be sought of as an "electronic force" to propagate |ψ i in dimensionless time τ . The second term is an additional damping term to relax more quickly to the instantaneous electronic ground-state, where γ is an appropriate damping coefficient and ω the undamped angular frequency of the universal oscillator. The final term derives from the constraint to fulfill the holonomic orthonormality condition ψ i |ψ j = δ ij , by using the Hermitian Lagrangian multiplier matrix Λ. As can be seen in Eq. 123b, not even a single diagonalization step, but just one "electronic force" calculation is required. In other words, the second-generation CPMD method not only entirely bypasses the necessity of a SCF cycle, but also the alternative iterative wavefunction optimization.
However, contrary to the evolution of the nuclei, for the short-term integration of the electronic degrees of freedom accuracy is crucial, which is why a highly accurate yet efficient propagation scheme is essential. As a consequence, the evolution of the MOs is conducted by extending the always-stable predictor-corrector integrator of Kolafa to the electronic structure problem [269]. But, since this scheme was originally devised to deal with classical polarization, special attention must be paid to the fact that the holonomic orthonormality constraint, which is due to the fermionic nature of electrons that forces the wavefunction to be antisymmetric in order to comply with the Pauli exclusion principle, is always satisfied during the dynamics. For that purpose, first the predicted MOs at time t n are computed based on the electronic degrees of freedom from the K previous AIMD steps: (124) This is to say that the predicted one-electron density operator ρ p (t n ) is used as a projector onto to occupied subspace |ψ i (t n−1 ) of the previous AIMD step. In this way, we take advantage of the fact that ρ p (t n ) evolves much more smoothly than |ψ p i (t n ) and is therefore easier to predict. This is particularly true for metallic systems, where many states crowd the Fermi level. Yet, to minimize the error and to further reduce the deviation from the instantaneous electronic ground-state, |ψ p i (t n ) is corrected by performing a single minimization step |δψ p i (t n ) along the preconditioned electronic gradient direction, as computed by the orthonormality conserving orbital OT method described in section VII C [158]. Therefore, the modified corrector can be written as: The eventual predictor-corrector scheme leads to an electron dynamics that is rather accurate and time-reversible up to O(∆t 2K−2 ), where ∆t is the discretized integration time step, while ω is chosen so as to guarantee a stable relaxation towards the minimum. In other words, the efficiency of the present second-generation CPMD method stems from the fact that the instantaneous electronic ground state is very closely approached within just one electronic gradient calculation. We thus totally avoid the SCF cycle and any expensive diagonalization steps, while remaining very close to the BO surface and, at the same time, ∆t can be chosen to be as large as in BOMD. However, in spite of the close proximity of the propagated MOs to the instantaneous electronic ground state, the nuclear dynamics is slightly dissipative, most likely because the employed predictor-corrector scheme is not symplectic. The validity of this assumption has been independently verified by various numerical studies of others [2,[270][271][272]. Nevertheless, presuming that the energy is exponentially decaying, which had been shown to be an excellent assumption [3,224,225], it is possible to rigorously correct for the dissipation by modeling the nuclear forces of second-generation CPMD F CP where F BO I are the exact, but inherently unknown BO forces and γ D is an intrinsic, yet to be determined friction coefficient to mimic the dissipation. The presence of damping immediately suggests a canonical sampling of the Boltzmann distribution by means of the following modified Langevin-type equation: where Ξ D I is an additive white noise term, which must obey the fluctuation-dissipation theorem Ξ D I (0)Ξ D I (t) = 2γ D M I k B T δ(t) in order to guarantee an accurate sampling of the Boltzmann distribution.
This is to say that if one knew the unknown value of γ D it would nevertheless be possible to ensure an exact canonical sampling of the Boltzmann distribution in spite of the dissipation. Fortunately, the inherent value of γ D does not need to be known a priori, but can be bootstrapped so as to generate the correct average temperature [3,224,225], as measured by the equipartition theorem More precisely, in order to determine the hitherto unknown value of γ D , we perform a preliminary simulation in which we vary γ D on-the-fly using a Berendsen-like algorithm until Eq. 128 is eventually satisfied [253]. Alternatively, γ D can also be continously adjusted automatically using the adaptive Langevin technique of Leimkuhler and coworkers [273][274][275]. In this method, the friction coefficient γ D of Eq. 127 is reinterpreted as a dynamical variable, defined by a negative feedback loop control law as in the Nosé-Hoover scheme [276,277]. The corresponding dynamical equation for γ D reads aṡ where K is the kinetic energy, n is the number of degrees of freedom and Q = k B T τ 2 N H is the Nose-Hoover fictitious mass with time constant τ N H .
C. Low-cost linear-scaling ab-initio molecular dynamics based on compact localized molecular orbitals The computational complexity of CLMO DFT described in section VIII B grows linearly with the number of molecules, while its overhead cost remains very low because of the small number of electronic descriptors and efficient optimization algorithms (see Fig. 6). These features make CLMO DFT a promising method for accurate AIMD simulations of large molecular systems.
The difficulty of adopting CLMO DFT for dynamical simulations arises from the nonvariational character of compact orbitals. While CLMOs can be reliably optimized using the two-stage SCF procedure described in section VIII B, the occupied subspace defined in the first stage must remain fixed during the second stage to achieve convergence. In addition, electron delocalization effects can suddenly become inactive in the course of a dynamical simulation when a neighboring molecule crosses the localization threshold R c . Furthermore, the variational optimization of orbitals is in practice inherently not perfect and terminated once the norm of the electronic gradient drops below a small, but nevertheless finite convergence threshold SCF . These errors accumulate in long AIMD trajectories leading to non-physical sampling and/or numerical problems.
Traditional strategies to cope with these problems are computationally expensive and include increasing R c , decreasing SCF and computing the nonvariational contribution to the forces via a coupled-perturbed procedure [278,279]. CP2K implements another approach that uses the CLMO state obtained after the two-stage CLMO SCF procedure to compute only the inexpensive Hellmann-Feynman and Pulay components and neglects the computationally intense nonvariational component of the nuclear forces. To compensate for the missing component, CP2K uses a modified Langevin equation of motion that is fine-tuned to retain stable dynamics even with imperfect forces. This approach is known as the second-generation CPMD methodology of Kühne and workers [3,224,225], which is discussed in detail in section IX B. Its combination with CLMO DFT is described in Ref. 210.
An application of CLMO AIMD to liquid water demon- strates that the compensating terms in the modified Langevin equation can be tuned to maintain a stable dynamics and reproduce accurately multiple structural and dynamical properties of water with tight orbital localization (R c = 1.6 vdWR) and SCF as high as 10 −2 a.u. [210].
The corresponding results are shown in Fig. 7. The low computational overhead of CLMO AIMD, afforded by these settings, makes it ideal for routine medium-scale AIMD simulations, while its linear-scaling complexity allows to extend first-principle studies of molecular systems to previously inaccessible length scales. It is important to note that AIMD in CP2K cannot be combined with the CLMO methods based on the Harris functional and perturbative theory (see Section VIII B). A generalization of the CLMO AIMD methodology to systems of strongly interacting atoms (e.g. covalent crystals) is underway.

D. Multiple-time-step integrator
The AIMD-based MTS integrator presented here is based on the reversible reference system propagator algorithm (r-RESPA), which was developed for classical molecular dynamics (MD) simulations [280]. Using a carefully constructed integration scheme, the time evolution is reversible, and the MD simulation remains accurate and energy conserving. In AIMD-MTS, the difference in computational cost between a hybrid and a local XC functional is exploited, by performing a hybrid calculation only after several conventional DFT integration timesteps. r-RESPA is derived from the Liouville operator representation of Hamilton mechanics: where L is the Liouville operator for the system containing f degrees of freedom. The corresponding positions and momenta are denoted as x j and p j , respectively. This operator is then used to create the classical propagator U (t) for the system, which reads as Decomposing the Liouville operator into two parts and applying a 2nd-order Trotter-decomposition to the corresponding propagator yields = e iL1(δt/2) e iL2δt e iL1(δt/2) n + O(δt 3 ), with δt = ∆t/n. For this propagator, several integrator schemes can be derived [281]. The extension for AIMD-MTS is obtained from a decomposition of the force in the Liouville operator into two or more separate forces, i.e.
For that specific case, the propagator can be written as This allows to treat F 1 and F 2 with different time-steps, while the whole propagator still remains time-reversible. The procedure for F 1 and F 2 will be referred to as the inner and the outer loop, respectively. In the AIMD-MTS approach, the forces are split in the following way where F accur are the forces computed by the more accurate method, e.g. hybrid DFT, whereas F approx are forces as obtained from a less accurate method, e.g. by GGA XC functionals. Obviously, the corresponding Liouville operator equals the purely accurate one. The advantage of this splitting is that the magnitude of F 2 is usually much smaller than that of F 1 . To appreciate that, it has to be considered how closely geometries and frequencies obtained by hybrid DFT normally match the ones obtained by a local XC functional, in particular for stiff degrees of freedom. The difference of the corresponding Hessians is therefore small and low-frequent. However, high-frequency parts are not removed analytically, thus the theoretical upper limit for the time-step of the outer loop remains half the period of the fastest vibration [282]. The gain originates from an increased accuracy and stability for larger time-steps in the outer loop integration. Even using an outer loop time-step close to the theoretical limit, a stable and accurate AIMD is obtained. Additionally, there is no shift to higher frequencies as the (outer loop) time-step is increased, contrary to the single time-step case.

X. ENERGY DECOMPOSITION AND SPECTROSCOPIC ANALYSIS METHODS
Within CP2K, the nature of intermolecular bonding and vibrational spectra can be rationalized by EDA, normal mode analysis (NMA) and mode selective vibrational analysis (MSVA) methods, respectively.

A. Energy decomposition analysis based on compact localized molecular orbitals
Intermolecular bonding is a result of the interplay of electrostatic interactions between permanent charges and multipole moments on molecules, polarization effects, Pauli repulsion, donor-acceptor orbital interactions (also known as covalent, charge-transfer or delocalization interactions) and weak dispersive forces. The goal of EDA is to measure the contribution of these components within the total binding energy, thus gaining deeper insight into physical origins of intermolecular bonds.
To that extent, CP2K contains an EDA method based on ALMOs -molecular orbitals localized entirely on single molecules or ions within a larger system [208,212]. The ALMO EDA separates the total interaction energy (∆E TOT ) into a frozen density (FRZ), polarization (POL) and charge-transfer (CT) terms, i.e.
which is conceptually similar to the Kitaura-Morokuma EDA [283] -one of the first EDA methods. The frozen interactions term is defined as the energy required to bring isolated molecules into the system without any relaxation of their MOs, apart from modifications associated with satisfying the Pauli exclusion principle: where E(R x ) is the energy of isolated molecule x and R x the corresponding density matrix, whereas R FRZ is the density matrix of the system constructed from the unrelaxed MOs of the isolated molecules. The ALMO EDA is also closely related to the block-localized wavefunction EDA [284], because both approaches use the same variational definition of the polarisation term as the energy lowering due to the relaxation of each molecule's ALMOs in the field of all other molecules in the system: The strict locality of ALMOs is utilized to ensure that the relaxation is constrained to include only intramolecular variations. This approach, whose mathematical and algorithmic details have been described by many authors [203,208,211,285,286], gives an upper limit to the true polarisation energy [287]. The remaining portion of the total interaction energy, the CT term, is calculated as the difference in the energy of the relaxed ALMO state and the state of fully delocalized optimized orbitals (R SCF ): A distinctive feature of the ALMO EDA is that the chargetransfer contribution can be separated into contributions associated with forward and back-donation for each pair of molecules, as well as a many-body higher-order (induction) contribution (HO), which is very small for typical intermolecular interactions. Both, the amount of the electron density transferred between a pair of molecules and the corresponding energy lowering can be computed:  The ALMO EDA implementation in CP2K is currently restricted to closed-shell fragments. The efficient O(N ) optimization of ALMOs serves as its underlying computational engine [209]. The ALMO EDA in CP2K can be applied to both gas-phase and condensed matter systems. It has been recently extended to fractionally occupied ALMOs [288], thus enabling the investigation of interactions between metal surfaces and molecular adsorbates. Another unique feature of the implementation in CP2K is the ability to control the spatial range of charge-transfer between molecules using the cutoff radius R c of CLMOs (see Section VIII B). Additionally, the ALMO EDA in combination with CP2K's efficient AIMD engine allows to switch off the CT term in AIMD simulations, thus measuring the CT contribution to dynamical properties of molecular systems [289].
The ALMO EDA has been applied to study intermolecular interactions in a variety of gas-and condensed-phase molecular systems [288,290]. The CP2K implementation of ALMO EDA has been particularly advantageous to understand the nature of hydrogen bonding in bulk liquid water [208,214,216,255,258,259,261,289], ice [291] and water phases confined to low dimensions [292]. Multiple studies have pointed out a deep connection between the donor-acceptor interactions and features of the X-ray absorption [214,293], infrared (IR) [255,256,258,261,289,294,295] (Fig. 8), and nuclear magnetic resonance spectra of liquid water [216]. Extension of ALMO EDA to AIMD simulations have shown that the insignificant covalent component of HB determines many unique properties of liquid water including its structure (Fig. 8), dielectric constant, hydrogen bond lifetime, rate of self-diffusion and viscosity [210,289].
B. Normal mode analysis of infrared spectroscopy IR vibrational spectra can be obtained with CP2K by carrying out a NMA, within the Born-Oppenheimer approximation. The expansion of the potential energy in a Taylor series around the equilibrium geometry gives At the equilibrium geometry, the first derivative is zero by definition, hence in the harmonic approximation only the second derivative matrix (Hessian) has to be calculated. In our implementation, the Hessian is obtained with the three-point central difference approximation, which for a system of M particles corresponds to 6M force evaluations. The Hessian matrix is diagonalised to determine the 3M Q vectors representing a set of decoupled coordinates, which are the normal modes. In mass weighted coordinates, the angular frequencies of the corresponding vibrations are then obtained by the eigenvalues of the second derivative matrix.

C. Mode selective vibrational analysis
For applications in which only few vibrations are of interest, the MSVA presents a possibility to reduce the computational cost significantly [296,297]. Instead of computing the full second derivative matrix, only a subspace is evaluated iteratively using the Davidson algorithm. As in NMA, the starting point of this method is the eigenvalue equation where H (m) is the mass weighted Hessian in Cartesian coordinates. Instead of numerically computing the Hessian elementwise, the action of the Hessian on a predefined or arbitrary collective displacement vector is chosen, where e (m) i are the 3M nuclear basis vectors. The action of the Hessian on d is given in first approxi-mation as (145) The first derivatives with respect to the nuclear positions is simply the force which can be computed analytically. The derivative of the force along components i with respect to d can then be evaluated as a numerical derivative using the three-point central difference approximation to yield the vector σ. The subspace approximation to the Hessian at the i-th iteration of the procedure is a i × i matrix where B and Σ are the collection of the d and Σ vectors up to the actual iteration step. Solving the eigenvalue problem for the small Davidson matrix we obtain approximate eigenvaluesλ i k . From the resulting eigenvectors. The residuum vector where the sum is over all the basis vectors d k , the number of which increases at each new iteration i. The residuum vector is used as displacement vector for the next iteration. The approximation to the the exact eigenvector m at the i-th iteration is This method avoids the evaluation of the complete Hessian, and therefore requires fewer force evaluations. Yet, in the limit of 3M iterations, the exact Hessian is obtained and thus the exact frequencies and normal modes in the limit of the numerical difference approximation.
As this is an iterative procedure (Davidson subspace algorithm), the initial guess is important for convergence. Moreover, there is no guaranty that the created subspace will contain the desired modes in case of a bad initial guess. In CP2K the choice of the target mode can be a single mode, a range of frequencies, or modes localised on preselected atoms. If little is known about the modes of interest (e.g. a frequency range and contributing atoms) an initial guess can be build by a random displacement of atoms. In case, where a single mode is tracked, one can use normal modes obtained from lower quality methods. The residuum vector is calculated with respect to the mode with the eigenvalue closest to the input frequency. The method will only converge to the mode of interest if the initial guess is suitable. With the implemented algorithm always the mode closest to the input frequency is improved. Using an arbitrary initial guess, the mode of interest might not be present in the subspace at the beginning. It is important to note that the Davidson algorithm might converge before the desired mode becomes part of the subspace. Therefore, there is no warranty that the algorithm would always converges to the desired mode. By giving a range of frequencies as initial guess, it might happen that either none or more than one mode is already present in the spectrum. In the first case, the mode closest to the desired range will be tracked. In the second case, always the least converged mode will be improved. If the mode is selected by a list of contributing atoms, at each step the approximations to the eigenvectors of the full Hessian are calculated, and the vector with the largest contributions of the selected atoms is tracked for the next iteration step.
The MSVA scheme can be efficiently parallelized by either distributing the force calculations or using the block Davidson algorithm. In the latter approach, the parallel environment is split into n sub-environments, each consisting out of m processors performing the force calculations. The initial vectors are constructed for all n environments, such that n orthonormal vectors are generated. After each iteration step, the new d and σ vectors are combined into a single D and Σ matrix. These matrices are then used to construct the approximate Hessian, and from this the n modes closest to the selected vectors are again distributed.
The IR intensities are defined as the derivatives of the dipole vector with respect to the normal mode. Therefore, it is necessary to activate the computation of the dipoles together with the NMA. By applying the mode selective approach large condensed matter systems can be addressed. An illustrative example is the study by Schiffmann et al. [298], where the interaction of the N3, N719, and N712 dyes with anatase(101) have been modelled. The vibrational spectra for all low-energy conformations have been computed and used to assist the assignment of the experimental IR spectra, revealing a protonation dependent binding mode and the role of self-assembly in reaching high coverage.

XI. EMBEDDING METHODS
CP2K aims to provide a wide range of potential energy methods, ranging from empirical approaches such as classical force-fields over DFT-based techniques to quantum chemical methods. In addition multiple descriptions can be arbitrarily combined at the input level, so that many combinations of methods are directly available. Examples are schemes that combine two or more potential energy surfaces via (150) linear combinations of potentials as necessary in alchemical free-energy calculations or propagation of the lowest potential energy surfaces However, beside such rather simplistic techniques [299], more sophisticated embedding methods described in the following are also available within CP2K.

A. QM/MM methods
The QM/MM multi-grid implementation in CP2K is based on the use of an additive QM/MM scheme [300][301][302]. The total energy of the molecular system can be partitioned into three disjointed terms: These energy terms depend parametrically on the coordinates of the nuclei in the quantum region (R α ) and on classical atoms (R a ). Hence, E QM is the pure quantum energy, computed using the Quickstep code [5], whereas E M M is the classical energy, described through the use of the internal classical molecular-mechanics (MM) driver called FIST. The latter allows the use of the most common force-fields employed in MM simulations [303,304]. The interaction energy term E QM/M M contains all nonbonded contributions between the QM and the MM subsystems, and in a DFT framework we express it as: where R a is the position of the MM atom a with charge q a , ρ(r, R α ) is the total (electronic plus nuclear) charge density of the quantum system, and v NB (R α , R a ) is the non-bonded interaction between classical atom a and quantum atom α. The electrostatic potential of the MM atoms v a (|r − R a |) is described using for each atomic charge a Gaussian charge distribution (155) with width r c,a , eventually resulting in: This renormalised potential has the desired property of tending to 1/r at large distances and going smoothly to a constant for small r (see Ref. 305 for renormalization details). Due to the Coulomb long-range behavior, the computational cost of the integral in Eq. 154 can be very large. In CP2K, we designed a decomposition of the electrostatic potential in terms of Gaussian functions with different cutoffs combined with a real-space multi-grid framework to accelerate the calculation of the electrostatic interaction. We named this method Gaussian Expansion of the Electrostatic Potential or GEEP [305]. The advantage of this methodology is that grids of different spacing can be used to represent the different contributions of v a (r, R a ), instead of using only the same grid employed for the mapping of the electronic wavefunction. In fact, by writing a function as a sum of terms with compact support and with different cutoffs, the mapping of the function can be efficiently achieved using different grid levels, in principle as many levels as contributing terms, each optimal to describe the corresponding term.

QM/MM for isolated systems
For isolated systems, each MM atom is represented as a continuous Gaussian charge distribution and each GEEP term is mapped on one of the available grid levels, chosen to be the first grid whose cutoff is equal to or bigger than the cutoff of that particular GEEP contribution. However, all atoms contribute to the coarsest grid level through the long-range R low part, which is the smooth GEEP component [305]. The result of this collocation procedure is a multi-grid representation of the QM/MM electrostatic potential V (r, R a ) is sequentially interpolated starting from the coarsest grid level up to the finest level, using real-space interpolator and restrictor operators.
Using the real-space multi-grid operators together with the GEEP expansion, the prefactor in the evaluation of the QM/MM electrostatic potential has been lowered from where N f is the number of grid points on the finest grid and N c is the number of grid points on the coarsest grid. The computational cost of the other operations for evaluating the electrostatic potential, such as the mapping of the Gaussians and the interpolations, becomes negligible in the limit of a large MM system, usually more than 600-800 MM atoms.
Using the fact that grids are commensurate (N f /N c = 2 3(N grid −1) ), and employing for every calculation 4 grid levels, the speed-up factor is around 512 (2 9 ). This means that the present implementation is 2 orders of magnitude faster than the direct analytical evaluation of the potential on the grid.

QM/MM for periodic systems
The effect of the periodic replicas of the MM subsystem is only in the long-range term, and comes entirely from the residual function R low (r, R a ): where L labels the infinite sums over the period replicas.
Performing the same manipulation used in Ewald summation [306], the previous equation can be computed more efficiently in reciprocal space, i.e.
. (158) The termR low (k), representing the Fourier transform of the smooth electrostatic potential, can be evaluated analytically via: The potential in Eq. 158 can be mapped on the coarsest available grid. Once the electrostatic potential of a single MM charge within periodic boundary conditions is derived, the evaluation of the electrostatic potential due to the MM subsystem is easily computed employing the same multi-grid operators (interpolation and restriction) used for isolated systems. The description of the long-range QM/MM interaction with periodic boundary conditions requires the description of the QM/QM periodic interactions, which plays a significant role if the QM subsystem has a net charge different from zero, or a significant dipole moment. Here, we exploit a technique proposed few years ago by Blöchl [307], for decoupling the periodic images and restoring the correct periodicity also for the QM part. A full and comprehensive description of the methods is reported in [308].

Image charge augmented QM/MM
The image charge (IC) augmented QM/MM model in CP2K has been developed for the simulation of molecules adsorbed at metallic interfaces [309]. In the IC-QM/MM scheme, the adsorbates are described by KS-DFT and the metallic substrate is treated at the MM level of theory. The interactions between the QM and MM subsystems are modeled by empirical potentials of, e.g. the Lennard-Jones-type to reproduce dispersion and Pauli repulsion. Polarization effects due to electrostatic screening in metallic conductors are explicitly accounted for by applying the IC formulation.
The charge distribution of the adsorbed molecules generates an electrostatic potential V e (r), which extends into the substrate. If the electrostatic response of the metal is not taken into account in our QM/MM setup, the electrostatic potential has different values at different points r inside the metal slab, as illustrated in Fig. 9. However, the correct physical behavior is that V e (r) induces an electrostatic response such that the potential in the metal is zero or at least constant. Following Ref. 310, we describe the electrostatic response of the metallic conductor by introducing an image charge distribution ρ m (r), which is modeled by a set of Gaussian charges {g a } centered at the metal atoms, so that where R a are the positions of the metal atoms. The unknown expansion coefficients c a are determined selfconsistently imposing the constant-potential condition, i.e. the potential V m (r) generated by ρ m (r) screens V e (r) within the metal, so that V e (r) + V m (r) = V 0 , where V 0 is a constant potential that can be different from zero if an external potential is applied. The modification of the electrostatic potential upon application of the IC correction is shown in Fig. 9. Details on the underlying theory and implementation can be found in Ref. 309. The IC-QM/MM scheme provides a computationally efficient way to describe the MM-based electrostatic interactions between adsorbate and metal in a fully selfconsistent fashion allowing the charge densities of the QM and MM parts to mutually modify each other. Except for the positions of the metal atoms no input parameters are required. The computational overhead compared to a conventional QM/MM scheme is in the current im-plementation negligible. The IC augmentation adds an attractive interaction because ρ m (r) mirrors the charge distribution of the molecule, but has the opposite sign (see Ref 309 for the energy expression). Therefore, the IC scheme strengthens the interactions between molecules and metallic substrates, in particular if adsorbates are polar or even charged. It also partially accounts for the rearrangement of the electronic structure of the molecules when they approach the surface [309].
The IC-QM/MM approach is a good model for systems, where the accurate description of the molecule-molecule interactions is of primary importance and the surface acts merely as a template. For such systems, it is sufficient when the augmented QM/MM model predicts the structure of the first adsorption layer correctly. The IC-QM/MM approach has been applied, for example, to liquid water/Pt (111) interfaces [309,311], organic thin-film growth on Au(100) [312], as well as aqueous solutions of DNA molecules at gold interfaces [313,314]. Instructions how to set up an IC-QM/MM calculation with CP2K can be found in Ref. 315.

Partial atomic charges from electrostatic potential fitting
Electrostatic potential (ESP) fitting is a popular approach to determine a set of partial atomic point charges {q a }, which can then be used in, e.g. classical MD simulations to model electrostatic contributions in the force-field. The fitting procedure is applied such that the charges {q a } optimally reproduce a given potential V QM obtained from a quantum chemical calculation. To avoid unphysical values for the fitted charges, restraints (and constraints) are often set for q a , which are then called RESP charges [316].
The QM potential is typically obtained from a preceding DFT or Hartree-Fock calculation. The difference between V QM (r) and the potential V RESP (r) generated by {q a } is minimized in a least squares fitting procedure at defined grid points r k . The residual R esp that is minimized is where N is the total number of grid points included in the fit. The choice of {r k } is system dependent and choosing {r k } carefully is important to obtain meaningful charges. For isolated molecules or porous periodic structures (e.g. metal-organic frameworks) {r k } are sampled within a given spherical shell around the atoms defined by a minimal and maximal radius. The minimal radius is usually set to values larger than the van der Waals radius to avoid fitting in spatial regions, where the potential varies rapidly, which would result in a destabilization of the fit. As a general guideline, the charges should be fitted in the spatial regions relevant for interatomic interactions. CP2K offers also the possibility to fit RESP charges for slab-like systems. For these systems, it is important to reproduce the potential accurately above the surface where, e.g. adsorption processes take place. The sampling technique is flexible enough to follow a corrugation of the surface, see Ref 317. When calculating RESP charges for periodic systems, periodic boundary conditions have to be employed for the calculation of V RESP . In CP2K, the RESP point charges q a are represented as Gaussian functions. The resulting charge distribution is presented on a regular real-space grid and the GPW formalism is employed to obtain a periodic RESP potential. Details of the implementation can be found in Ref. 317.
CP2K features also a GPW implementation of the REPEAT method [318], which is a modification of the RESP fitting for periodic systems. The residual in Eq. 161 is modified such that the variance of the potentials instead of the absolute difference is fitted: The REPEAT method was originally introduced to account for the arbitrary offset of the electrostatic potential in infinite systems, which depends on the numerical method used. In CP2K, V QM and V RESP are both evaluated with the same method, the GPW approach, and have thus the same offset. However, fitting the variance is an easier task than fitting the absolute difference in the potentials and stabilizes significantly the periodic fit.
Using the REPEAT method is thus recommended for the periodic case, in particular for systems that are not slab-like. For the latter, the potential above the surface changes very smoothly and we find that δ ≈ 0 [317]. The periodic RESP and REPEAT implementation in CP2K has been used to obtain atomic charges for surface systems, such as corrugated hexagonal boron nitride (hBN) monolayers on Rh(111) [317]. These charges were then used in MD QM/MM simulations of liquid water films (QM) on the hBN@Rh(111) substrate (MM). Other applications comprise two-dimensional periodic supramolecular structures [319], metal-organic frameworks [320][321][322], as well as graphane [323]. Detailed instructions how to set up a RESP or REPEAT calculation with CP2K can be found under Ref. 324.
B. Density functional embedding theory

Theory
Quantum embedding theories are multi-level approaches applying different electronic structure methods to subsystems, interacting with each other quantum-mechanically [325]. Density functional embedding theory (DFET) introduces high-order correlation to a chemically relevant subsystem (cluster), whereas the environment and the interaction between the cluster and the environment is described with DFT via the unique local embedding potential v emb (r) [326]. The simplest method to calculate the total energy is in the first-order perturbation theory fashion: where E DF T cluster,emb and E DF T env,emb are DFT energies of the embedded subsystems, whereas E CW cluster,emb is the energy of the embedded cluster at the correlated wavefunction (CW) level of theory. All these entities are computed with an additional one-electron embedding term in the Hamiltonian.
The embedding potential is obtained from the condition that the sum of embedded subsystem densities should reconstruct the DFT density of the total system. This can be achieved by maximizing the Wu-Yang functional with respect to v emb [327]: (165) with the functional derivative to be identical to the density difference δW δV emb = ρ total − ρ cluster − ρ env .

Implementation
The DFET workflow consists of computing the total system with DFT and obtaining v emb by repeating DFT calculations on the subsystems with updated embedding potential until the total DFT density is reconstructed. When the condition is fulfilled, the higher-level embedded calculation is performed on the cluster. The main computational cost comes from the DFT calculations. The cost is reduced by a few factors such as employing an wavefunction extrapolation from the previous iterations via the ASPC integrator [269].
The DFET implementation is available for closed and open electronic shells in unrestricted and restricted openshell variants for the latter. It is restricted to GPW calculations with pseudopotentials describing the core electrons (i.e. full-electron methods are currently not available). Any method implemented within Quickstep is available as a higher-level method, including hybrid DFT, MP2 and RPA. It is possible to perform property calculations on the embedded clusters using an externally provided v emb . The subsystems can employ different basis sets, although they must share the same PW grid.
In our implementation, v emb may be represented on the real-space grid, as well as using a finite Gaussian basis set, the first option being preferable, as it allows a much more accurate total density reconstruction and is computationally cheaper.
C. Implicit solvent techniques AIMD simulations of solutions, biological systems or surfaces in presence of a solvent are often computationally dominated by the calculation of the explicit solvent portion, which may easily amount to 70% of the total number of atoms in a model system. Although the first and second sphere or layer of the solvent molecules around a solute or above a surface might have a direct impact via chemical bonding, for instance via a hydrogen bonding network, the bulk solvent mostly interacts electrostatically as a continuous dielectric medium. This triggered the development of methods treating the bulk solvent implicitly. Such implicit solvent methods have also the further advantage that they provide a statistically averaged effect of the bulk solvent. That is beneficial for AIMD simulations that consider only a relatively small amount of bulk solvent and quite short sampling times. The self-consistent reaction field (SCRF) in a sphere, which goes back to the early work by Onsager [328], is possibly the most simple implicit solvent method implemented in CP2K [329,330].
More recent continuum solvation models like the conductor-like screening model (COSMO) [331], as well as the polarizable continuum model (PCM) [332,333], take also into account the explicit shape of the solute. The solute-solvent interface is defined as the surface of a cavity around the solute. This cavity is constructed by interlocking spheres centered on the atoms or atomic groups composing the solute [334]. This introduces a discontinuity of the dielectric function at the solute-solvent interface and thus causes non-continuous atomic forces, which may impact the convergence of structural relaxations, as well as the energy conservation within AIMD runs. To overcome these problems, Fatteberg and Gygi proposed a smoothed self-consistent dielectric model function of the electronic density ρ elec (r) [335,336]: with the model parameters β and ρ 0 , which fulfills the asymptotic behaviour (r) ≡ ρ elec (r) = 1 large ρ elec (r) The method requires a self-consistent iteration of the polarization charge density spreading across the solutesolvent interface in each SCF iteration step, since the dielectric function depends on the actual electronic density ρ elec (r). This may cause a non-negligible computational overhead depending on the convergence behaviour. More recently, Andreussi et al. proposed in the framework of a revised self-consistent continuum solvation (SCCS) model an improved piecewise defined dielectric model function [337]: with t = t ln ρ elec (r) that employs a more elaborated switching function for the transition from the solute to the solvent region using the smooth function Both models are implemented in CP2K [338,339]. The solvation free energy ∆G sol can be computed by (170) as the sum of the electrostatic contribution ∆G el = G el − G 0 , where G 0 is the energy of the solute in vacuum, the repulsion energy G rep = α S, the dispersion energy G dis = β V , and the cavitation energy G cav = γ S, with adjustable solvent specific parameters α, β and γ [340]. Therein, S and V are the (quantum) surface and volume of the solute cavity, respectively, which are evaluated in CP2K based on the quantum surface and quantum volume a definition that was introduced by Cococcioni et al. with using the smoothed dielectric function either from Eq. 167 or Eq. 168, respectively [337,341]. The thermal motion term G tm and the volume change term P ∆V are often ignored and are not yet available in CP2K.

D. Poisson solvers
The Poisson equation describes how the electrostatic potential V (r), as a contribution to the Hamiltonian, relates to the charge distribution within the system with electron density n(r). In a general form, which incorporates a non-homogeneous position-dependent dielectric constant ε(r), the equation is formulated as − ∇ · (ε(r)∇V (r)) = 4πn(r), (174) endowed with some suitable boundary conditions determined by the physical characteristics or setup of the considered system. Under the assumption that everywhere inside the simulation domain the dielectric constant is equal to that of free space, i.e. 1, and for periodically repeated systems or boundary conditions for which the analytical Green's function of the standard Poisson operator is known, like free, wire or surface boundary conditions, a variety of methods has been proposed to solve Eq. 174. Among these approaches, CP2K implements the conventional PW scheme, Ewald summation based techniques [306,[342][343][344][345], the Martyna-Tuckerman method [12], and a wavelet based Poisson solver [14,15]. Within the context of continuum solvation models [332], where solvent effects are included implicitly in the form of a dielectric medium [335], the code also provides the solver proposed by Andreussi et al. that can solve Eq. (174) subject to periodic boundary conditions and with ε defined as a function of the electronic density [337].
Despite the success of these methods in numerous scenarios (for some recent applications see, e.g., [338,346]), the growing interest in simulating atomistic systems with more complex boundary configurations, e.g. nanoelectronic devices, at an ab-inito level of theory, demands for Poisson solvers with capabilities that exceed those of the existing solvers. To this end, a generalized Poisson solver has been developed and implemented in CP2K with the following key features [347]: • The solver converges exponentially with respect to the density cutoff.
• Periodic or homogeneous von Neumann (zero normal electric field) boundary conditions can be imposed on the boundaries of the simulation cell. The latter models insulating interfaces such as e.g. airsemiconductor and oxide-semiconductor interfaces in a transistor.
• Fixed electrostatic potentials (Dirichlet-type boundary conditions) can be enforced at arbitrarily-shaped regions within the domain. These correspond e.g. to source, drain and gate contacts of a nanoscale device.
• The dielectric constant can be expressed as any sufficiently smooth function.
Therefore, the solver offers advantages associated with the two categories of Poisson solvers, i.e. PW and realspace-based methods.
The imposition of the above-mentioned boundary setups is accomplished by solving an equivalent constrained variational problem that reads as: where Here, V D is the potential applied at the predefined subdomain Ω D that may have an arbitrary geometry, whereas u(V ) satisfies the desired boundary conditions at the boundaries of the cell Ω and µ(λ) are Lagrange multipliers introduced to enforce the constant potential. Note that, the formulation of the problem within Eqs. 175-176 is independent of how ε is defined. Furthermore, consistent ionic forces have been derived and implemented which make the solver applicable to a wider range of applications like, energy-conserving BOMD [348], as well as EMD simulations [349].

XII. DBCSR LIBRARY
DBCSR has been specifically designed to efficiently perform block-sparse and dense matrix operations on distributed multicore CPUs and GPUs systems, covering a range of occupancy between 0.01% up to dense [105,350,351]. The library is written in Fortran and is freely available under the GPL license from https://github. com/cp2k/dbcsr. Operations include sum, dot product, and multiplication of matrices, and the most important operations on single matrices, such as transpose and trace.
The DBCSR library was developed to unify the previous disparate implementations of distributed dense and sparse matrix data structures and the operations among them. The chief performance optimization target is parallel matrix multiplication. Its performance objectives have been to be comparable to ScaLAPACK's PDGEMM for dense or nearly-dense matrices [160], while achieving high performance when multiplying sparse matrices [105].
DBCSR matrices are stored in a blocked compressed sparse row (CSR) format distributed over a twodimensional grid of P message passing interface (MPI) processes. Although the library accepts single and double precision complex and real numbers, it is only optimized for the double precision real type. The sizes of the blocks in the data structure are not arbitrary, but are determined by the properties of the data stored in the matrix, such as the basis sets of the atoms in the molecular system. While this property keeps the blocks in the matrices dense, it often results in block sizes that are suboptimal for computer processing, e.g. blocks of 5 × 5, 5 × 13, and 13 × 13 for the H and O atoms described by a DZVP basis set.
To achieve high performance for the distributed multiplication of two possibly sparse matrices, several techniques have been used. One is to focus on reducing communication costs among MPI ranks. Another is to optimize the performance of multiplying communicated data on a node by introducing a separation of concerns: what will be multiplied vs. performing the multiplications. The CPUs on a node determine what needs to be multiplied, creating batches of work. These batches are handled by hardware-specific drivers, such as CPUs, GPUs, or a combination of both. DBCSR's GPU backend, LIBSMM ACC, supports both NVIDIA and AMD GPUs via CUDA and HIP, respectively. A schema of the library is shown in Fig. 10.

A. Message passing interface parallelization
At the top level, we have the MPI parallelization. The data-layout exchange is implemented with two different algorithms, depending on the sizes of the involved matrices in the multiplications: • for general matrices (any size) we use the Cannon algorithm, where the amount of communicated data by each process scales as O(1/ √ P ) [105,352]; • only for "tall-and-skinny" matrices (one large dimension) we use an optimized algorithm, where the amount of communicated data by each process scales as O(1) [353].
The communications are implemented with asynchronous point-to-point MPI calls. The local multiplication will start as soon as all the data has arrived at the destination process, and it is possible to overlap the local computation with the communication if the network allows that.

B. Local multiplication
The local computation consists of pairwise multiplications of small dense matrix blocks, with dimensions (m×k) for A blocks and (k × n) for B blocks. It employs a cacheoblivious matrix traversal to fix the order in which matrix blocks need to be computed, to improve memory locality (Traversal phase in Fig. 10). First, the algorithm loops over A matrix row-blocks and then, for each row-block, over B matrix column-blocks. Then, the corresponding multiplications are organized in batches (Generation phase in Fig. 10), where each batch consists of maximum 30,000 multiplications. During the Scheduler phase, a static assignment of batches with a given A matrix rowblock to OpenMP threads is employed to avoid data-race conditions. Finally, batches assigned to each thread can be computed in parallel on the CPU and/or executed on a GPU. For the GPU execution, batches are organized in such a way that the transfers between the host and the GPU are minimized. The multiplication kernels take full advantage of the opportunities for coalesced memory operations and asynchronous operations. Moreover, a double-buffering technique, based on CUDA streams and events, is used to maximize the occupancy of the GPU and to hide the data transfer latency [350]. When the GPU is fully loaded, the computation may be simultaneously done on the CPU. Multi-GPU execution on the same node is made possible by distributing the cards to multiple MPI ranks via a round-robin assignment.

C. Batched execution
Processing batches of small-matrix-multiplications (SMMs) has to be highly efficient. For this reason specific libraries were developed that outperform vendor BLAS libraries, namely LIBSMM ACC (previously called LIBCUSMM, which is part of DBCSR) for GPUs [353], as well as LIBXSMM for CPUs [354,355].
In LIBSMM ACC, GPU kernels are just-in-time (JIT) compiled at runtime. This allows to reduce DBCSR's compile time by more than half and its library's size by a factor of 6 compared to generating and compiling kernels ahead-oftime. Fig. 11 illustrates the performance gain that can be observed since the introduction of the JIT framework: because including a new (m,n,k)-kernel to the library incurs no additional compile time, nor does it bloat the library size, all available (m,n,k) batched-multiplications can be run on GPUs, leading to important speedups.
LIBSMM ACC's GPU kernels are parametrized over 7 parameters, affecting the memory usages and patterns of the multiplication algorithms, the amount of work and number of threads per CUDA block, the number of matrix elements computed by each CUDA thread and the tiling sizes. An autotuning framework finds the optimal parameter set for each (m,n,k)-kernel, exploring about 100,000 possible parameter combinations. These parameter combinations result in vastly different performances: for example, the batched multiplication of kernel 5 × 6 × 6 can vary between 10 Gflops and 276 Gflops in performance, with only 2% of the possible parameter combinations yielding performances within 10% of the optimum. Kernels are autotuned for NVIDIA's P100 (1,649 (m,n,k)s) and V100 (1,235 (m,n,k)s), as well as for AMD's Mi50 (10 kernels). This autotuning data is then used to train a machine learning performance prediction model, which finds near-optimal parameter sets for the rest of the 75,000 available (m,n,k)-kernels. In this way, the library can achieve a speedup in the range of 2-4x with respect to batched DGEMM in cuBLAS for {m, n, k} < 32, while the effect becomes less prominent for larger sizes [355]. Performance of LIBSMM ACC and LIBXSMM saturate for {m, n, k} > 80, for which DBCSR directly calls cuBLAS or BLAS respectively. LIBXSMM generates kernels using a machine model, e. g. considering the size of the (vector-)register file. Runtime code generation in LIBXSMM does not depend on the compiler or flags used and can naturally exploit instruction set extensions presented by CPUID features. Raw machine code is emitted JIT with no compilation phase and one-time cost (10k-50k kernels per second depending on system and kernel). Generated code is kept for the runtime of an application (like normal functions), and ready-for-call at a rate of 20M-40M dispatches per second. Both timings include the cost of thread-safety or potentially concurrent requests. Any such overhead is even lower when amortized with homogeneous batches (same kernel on a per batch basis).

D. Outlook
An arithmetic intensity of AI = 0.3-3 FLOPS / Byte (double-precision) is incurred by, e. g.CP2K's needs. This low AI is bound by Stream Triad given that C = C +A * B is based on SMMs rather than scalar values. Since memory bandwidth is precious, lower or mixed precision (including single-precision) can be of interest given emerging support in CPUs and GPUs (see section XIV B).

XIII. INTERFACES TO OTHER PROGRAMS
Beside containing native f77 and f90 interfaces, CP2K also provides a library, which can be used in your own executable [356,357], and comes with a small helper program called CP2K-shell that includes a simple interactive command line interface with a well defined, parseable syntax. In addition, CP2K can be interfaced with external programs such as i-PI [358], PLUMED [359], or PyRETIS [360], and its AIMD trajectories analyzed using TAMkin [361], MD-Tracks [362], and TRAVIS [363], to name just a few.

A. Non-equilibrium Green's function formalism
The non-equilibrium Green's function (NEGF) formalism provides a comprehensive framework for studying the transport mechanism across open nanoscale systems in the ballistic, as well as scattering regimes [364][365][366]. Within the NEGF formalism, all observables of a system can be obtained from the single-particle Green's functions. In the ballistic regime, the main quantity of interest is the retarded Green's function G, which solves the following equation: where E is the energy level of the injected electron, S and H are the overlap and Hamiltonian matrices, respectively, I is the identity matrix and Σ the boundary (retarded) self-energy that incorporates the coupling between the central active region of the device and the semi-infinite leads. A major advantage of the NEGF formalism is that no restrictions are made on the choice of the Hamiltonian that means, empirical, semi-empirical, or ab initio Hamiltonians can be used in Eq. (177). Utilizing a, typically Kohn-Sham, DFT Hamiltonian, the DFT+NEGF method has established itself as a powerful technique due to its capability of modeling charge transport through complex nanostructures without any need for system-specific parameterizations [367,368]. With the goal of designing a DFT+NEGF simulator that can handle systems of unprecedented size, e.g. composed of tens of thousands of atoms, the quantum transport solver OMEN has been integrated within CP2K [369][370][371]. For a given device configuration, CP2K constructs the DFT Hamiltonian and overlap matrices. The matrices are passed to OMEN where the Hamiltonian is modified for open boundaries and charge and current densities are calculated in the NEGF or, the equivalently, wave function/quantum trans-mitting boundary formalisms [372]. Due to the robust and highly efficient algorithms implemented in OMEN and exploiting hybrid computational resources effectively, devices with realistic sizes and structural configurations can now be routinely simulated [373][374][375].
B. SIRIUS: Plane wave density functional theory support CP2K supports computations of the electronic ground state, including forces and stresses [376,377], in a PW basis. The implementation relies on the quantum engine SIRIUS [378].
The SIRIUS library has full support for GPUs and CPUs and brings additional functionalities to CP2K. Collinear and non-collinear magnetic systems with or without spin-orbit coupling can be studied with both pseudopotential PW [379], as well as full-potential linearized augmented PW methods [380,381]. All GGA XC functionals implemented in libxc are available [382]. Hubbard corrections are based on Refs. 383 and 384.
To demonstrate a consistent implementation of the PW energies, forcesand stresses in CP2K + SIRIUS, an AIMD of a Si 7 Ge supercell has been performed in the isothermalisobaric NPT ensemble. To establish the accuracy of the calculations, the ground state energy of the obtained atomic configurations at each time step has been recomputed with Quantum ESPRESSO (QE) using the same cutoff parameters, XC functionals and pseudopotentials [385,386]. In Fig. 12, it is shown that the total energy is conserved up to 10 −6 Ha. It also shows the evolution of the potential energies calculated with CP2K + SIRIUS (blue line) and QE (red dashed line). The two curves are shifted by a constant offset, which can be attributed to differences in various numerical schemes employed. After removal of this constant difference, the agreement between the two codes is of the order of 10 −6 Ha.

XIV. TECHNICAL AND COMMUNITY ASPECTS
With an average growth of 200 lines of code per day, the CP2K code comprises currently more than a million lines of code. Most of CP2K is written in Fortran95, with elements from Fortran03 and extensions such as OpenMP, OpenCL and CUDA C. It also employs various external libraries in order to incorporate new features and to decrease the complexity of CP2K, thereby increasing the efficiency and robustness of the code. The libraries range from basic functionality such as MPI [387], fast Fourier transforms (FFTW) [388], dense linear algebra (BLAS, LAPACK, ScaLAPACK, ELPA) [160][161][162], to more specialized chemical libraries to evaluate electron repulsion integrals (libint) and XC functionals (libxc) [382,389]. access to some part of the functionality by external programs. Having lean, library-like interfaces within CP2K has facilitated the implementation of features such as farming (running various inputs within a single job), general input parameter optimization, PIMD, or Gibbs-ensemble MC.
Good performance and massive parallel scalability are key features of CP2K. This is achieved using a multi-layer structure of specifically designed parallel algorithms. On the highest level, parallel algorithms are based on message passing with MPI, which is suitable for distributed memory architectures, augmented with shared memory parallelism based on threading and programmed using OpenMP directives. Ongoing work aims at porting the main algorithms of CP2K to accelerators and GPUs, as these energy efficient devices become more standard in supercomputers. At the lowest level, auto-generated and auto-tuned code allows for generating CPU-specific libraries that deliver good performance without a need for dedicated code development.

A. Hardware Acceleration
CP2K provides mature hardware accelerator support for GPUs and emerging support for FPGAs. The most computationally intensive kernels in CP2K are handled by the DBCSR library (see Sec. XII), which decomposes sparse matrix operations into BLAS and LIBXSMM library calls. For GPUs, DBCSR provides a CUDA driver, which offloads these functions to corresponding libraries for GPUs (LIBSMM ACC and cuBLAS). FFT operations can also be offloaded to GPUs using the cuFFT library. For FPGAs, however, there exist currently no general and widely used FFT libraries that could be reused. Hence, a dedicated FFT interface has been added to CP2K and an initial version of an FFT library for offloading complex-valued, single-precision 3D FFTs of sizes 32 3 to 128 3 to Intel Arria 10 and Stratix 10 FPGAs has been released [390].

B. Approximate Computing
The AC paradigm is an emerging approach to devise techniques for relaxing the exactness to improve the performance and efficiency of the computation [391]. The most common method of numerical approximation is the use of adequate data-widths in computationally intensive application kernels. Although many scientific applications use double-precision floating-point by default, this accuracy is not always required. Instead, low-and mixed-precision arithmetic has been very effective for the computation of inverse matrix roots [182], or solving systems of linear equations [392][393][394][395]. Driven by the growing popularity of artificial neural networks that can be evaluated and trained with reduced precision, hardware accelerators have gained improved low-precision computing support. For example, NVidia V100 GPU achieves 7 GFlops in double-precision and 15 GFlops single-precision, but up to 125 GFlops tensor performance with half-precision floating point operations. Using low-precision arithmetic is thus essential for exploiting the performance potential of upcoming hardware accelerators.
However, in scientific computing, where the exactness of all computed results is of paramount importance, attenuating accuracy requirements is not an option. Yet, for specific problem classes it is possible to bound or compensate the error introduced by inaccurate arithmetic. In CP2K, we can apply the AC paradigm to the force computation within MD and rigorously compensate the numerical inaccuracies due to low-accuracy arithmetic operations and still obtain exact ensemble-averaged expectation values, as obtained by time averages of a properly modified Langevin equation.
For that purpose, the notion of the second-generation CPMD method is reversed and we model the nuclear forces as where Ξ N I is an additive white noise that is due to a lowprecision force computation on an GPU or FPGA-based accelerator. Given that Ξ N I is unbiased, i.e.
F I (0) Ξ N I (t) ∼ = 0 (179) holds, it is nevertheless possible to accurately sample the Boltzmann distribution by means of a modified Langevintype equation [184,396,397]: This is to say that that the noise, as originating from a lowprecision computation, can be thought of as an additive white noise channel associated with a hitherto unknown damping coefficient γ N , which satisfies the fluctuationdissipation theorem As before, the specific value of γ N is determined in such a way so as to generate the correct average temperature, as measured by the equipartition theorem of Eq. 128, by means of the adaptive Langevin technique of Leimkuhler and coworkers [273][274][275].  Figure 14. Energy deviation and deviation from the idempotency condition of the density matrix as a function of matrix truncation threshold for the same system and calculation as in Fig. 3. For the reference energy a calculation with filter = 10 −7 was used.

C. Benchmarking
To demonstrate the parallel scalability of the various DFT-based and post-Hartree-Fock electronic structure methods implemented in CP2K strong-scaling plots with respect to number of cpu-cores are shown in Fig. 13.
The benefit of the AC paradigm, described in section XIV B, in terms of reduction of wall time for the computation of the STMV virus, which contains more than one million atoms, is illustrated in Fig. 3. Using the periodic implementation in CP2K of the GFN2-xTB model [195], an increase in efficiency of up to one order of magnitude can be observed within the most relevant matrix-sqrt and matrix-sign operations of the sign-method described in section VII E. The corresponding violation of the idempotency condition of the density matrix and the energy deviation are shown in Fig. 14. The resulting error within the nuclear forces can be assumed to be white, and hence can be compensated by means of the modified Langevin equation of Eq. 180.

D. MOLOPT basis set and deltatest
Even though CP2K supports different types of Gaussian-type basis sets, the MOLOPT type basis set have been found to perform particularly well for a large number of systems and also targets a wide range of chemical environments, including the gas phase, interfaces, and the condensed phase [26]. These generally rather contracted basis sets, which include diffuse primitives, are obtained by minimizing a linear combination of the total energy and the condition number of the overlap matrix for a set of molecules with respect to the exponents and contraction coefficients of the full basis. To verify reproducibility of DFT codes in the condensed matter community, the ∆-gauge (182) based on the Birch-Murnaghan equation of state has been established [398]. To gauge our ∆-test values, we have compared them to the ones obtained by plane wave code Abinit [399], where exactly the same dual-space Goedecker-Teter-Hutter (GTH) pseudopotentials as described in section II B have been employed [23][24][25]. As can be seen in Fig. 15, CP2K generally performs fairly well, with the remaining deviation being attributed to the particular pseudization approach.  Figure 15. ∆-values for DFT calculations (using the PBE XC functional and GTH pseudopotentials) of 1st to 5th row elements without Co & Lu, as computed between the bestperforming MOLOPT (i.e. DZVP, TZVP or TZV2P(X)) basis sets and WIEN2k reference results [400]. For comparison, the corresponding average for Abinit is 2.1 meV/atom for the semicore and 6.3 meV/atom for the regular pseudopotentials, respectively [401].

E. CP2K workflows
With the abundance of computing power, computational scientist are tackling ever more challenging problems. As a consequence, today's simulations often require complex workflows. For example, first the initial structure is prepared, then a series of geometry optimizations at increasing levels of theory are performed, before the actual observables can computed, which then finally need to be post-processed and analyzed. There is a strong desire to automate these workflows, which not only saves time, but also makes them reproducible and shareable. CP2K is interfacing with two popular frameworks for automating of such workflows: The Atomic Simulation Environment (ASE) [402] and the Automated Interactive Infrastructure and Database for Computational Science (AiiDA) [403].
The ASE framework is a Python library that is build around the Atoms and the Calculator classes. An Atoms object stores the properties of individual atoms such as their positions and atomic numbers. A Calculator object provides a simple unified API to a chemistry code such as CP2K. A calculation is performed by passing an Atoms object to a Calculator, which then returns energies, nuclear forces, and other observables. For example, running a CP2K calculation of a single water molecule requires just a few lines of Python code: $ export ASE_CP2K_COMMAND="cp2k_shell.sopt" $ python >>> from ase.calculators.cp2k import CP2K >>> from ase.build import molecule >>> calc = CP2K() >>> atoms = molecule('H2O', calculator=calc) >>> atoms.center(vacuum=2.0) >>> print(atoms.get_potential_energy()) -467.191035845 Based on these two powerful primitives, the ASE provides a rich library of common building blocks including structure generation, MD, local and global geometry optimizations, transition-state methods, vibration analysis, and many more. As such, the ASE is an ideal tool for quick prototyping and automatizing a small number of calculations.
The AiiDA framework, however, aims to enable the emerging field of high-throughput computations [404]. Within this approach, databases of candidate materials are automatically screened for the desired target properties. Since a project typically requires thousands of computations, very robust workflows are needed that can handle also rare failure modes gracefully. To this end, the AiiDA framework provides a sophisticated event-based workflow engine. Each workflow step is a functional building block with well defined inputs and outputs. This design allow the AiiDA engine to trace the data dependencies throughout the workflows and thereby record the provenance graph of every result in its database.

F. GitHub and general tooling
The CP2K source code is publicly available and hosted in a Git repository on https://github.com/cp2k/cp2k. While we still maintain a master repository similar to the previous Subversion repository, the current development process via pull requests foster code reviews and discussion, while the integration with a custom continuous integration (CI) system based on the Google Cloud Platform ensures the stability of the master branch by mandating successful regression tests prior to merging the code change [405]. The pull request based testing is augmented with a large number of additional regression testers running on different supercomputers [406], providing over 80% code coverage across all variants (MPI, OpenMP, CUDA/HIP, FPGA) of CP2K [407]. These tests are run periodically, rather than triggered live by Git commits, to workaround limitations imposed by the different sites and developers are being informed automatically in case of test failures.
Over time additional code analysis tools have been developed to help avoid common pitfalls and maintain consistent code style over the large code base of over 1 millions lines of code. Some of them -like our fprettify tool [408] -have now been adopted by other Fortran-based codes [409]. To simplify the workflow of the developers, all code analysis tools not requiring compilation are now invoked automatically on each Git commit if the developer has setup the pre-commit hooks for her clone of the CP2K Git repository [410]. Since CP2K has a significant number of optional dependencies, a series of toolchain scripts have been developed to facilitate the installation, currently providing the reference environment for running the regression tests.  (RGPIN-2016-0505). GKS and CJM are supported by the Department of Energies Basic Energy Sciences, Chemical Sciences, Geosciences, and Biosciences Division. UK based work was funded under the embedded CSE programme of the ARCHER UK National Supercomputing Service (http://www.archer.ac.uk), grants eCSE03-011, eCSE06-6, eCSE08-9, eCSE13-17 and the EPSRC (EP/P022235/1) grant "Surface and Interface Toolkit for the Materials Chemistry Community". The project received funding via the CoE MaX as part of the Horizon 2020 program (grant number No. 824143), the Swiss platform for advanced scientific computing (PASC), and the centre on Computational Design and Discovery of Novel Materials (MARVEL). Computational resources were provided by the Swiss National Supercomputing Centre (CSCS) and Compute Canada. The generous allocation of computing time on the FPGA-based supercomputer "Noctua" at PC 2 is kindly acknowledged.

DATA AVAILABILITY STATEMENT
The data that support the findings of this study are available from the corresponding author upon reasonable request.