Graph-based linear scaling electronic structure theory

We show how graph theory can be combined with quantum theory to calculate the electronic structure of large complex systems. The graph formalism is general and applicable to a broad range of electronic structure methods and materials, including challenging systems such as biomolecules. The methodology combines well-controlled accuracy, low computational cost, and natural low-communication parallelism. This combination addresses substantial shortcomings of linear scaling electronic structure theory, in particular with respect to quantum-based molecular dynamics simulations.


I. INTRODUCTION
The importance of electronic structure theory in materials science, chemistry, and molecular biology, relies on the development of theoretical methods that provide sufficient accuracy at a reasonable computational cost. Currently, the field is dominated by Kohn-Sham density functional theory [1][2][3][4] , which often combines good theoretical fidelity with a modest computational workload that is constrained mainly by the diagonalization of the Kohn-Sham Hamiltonian -an operation that scales cubically with the system size. However, for systems beyond a few hundred atoms, the diagonalization becomes prohibitively expensive.
This bottleneck was removed with the development of linear scaling electronic structure theory 5,6 , which allows calculations of systems with millions of atoms 7,8 . Unfortunately, the immense promise of linear scaling electronic structure theory has never been fully realized because of some significant shortcomings, in particular, a) the accuracy is reduced to a level that is often difficult, if not impossible, to control; b) the computational pre-factor is high and the linear scaling benefit occurs only for very large systems that in practice often are beyond acceptable time limits or available computer resources; and c) the parallel performance is generally challenged by a significant overhead and the wall-clock time remains high even with massive parallelism. In quantum-based molecular dynamics (QMD) simulations 9 all these problems coalesce and we are constrained either to small system sizes or short simulation times.

A. Graph-based electronic structure theory
Our graph-based electronic structure theory relies on the equivalence between the calculation of thresholded sparse matrix polynomials and a graph partitioning approach. Let P (X) be a M th-order polynomial of a N × N symmetric square matrix X that is given as a linear combination of some basis polynomials T (n) (X), We define an approximation P τ (X) of P (X) using a globally thresholded sparse matrix algebra, where matrix elements with a magnitude below a numerical threshold τ in all terms, T (n) (X), are ignored. The pattern of the remaining matrix entries, which at any point of the expansion have been (or are expected to be) greater than τ , can be described by a data dependency graph S τ that represents all possible data dependencies between the matrix elements in the polynomial expansion. Formally, we define the graph S τ with a vertex for each row of X and an edge (i, j) between vertices i and j if For a matrix A, we denote by A Sτ the thresholded version of A, where The thresholded polynomial P τ (X) of P (X) with respect to S τ is given by where the thresholded T with T Sτ (X) = I. A key observation of this paper is that the calculation of P τ (X) in Eqs. (4) and (5) is equivalent to a partitioned subgraph expansion on S τ . This approach is illustrated in Fig. 1. For any vertex i of S τ , let s i τ be the subgraph of S τ induced by the core (meaning belonging to a single subgraph) vertex i and all halo (shared) vertices that are directly connected to i in S τ . Then the ith matrix column of P τ (X) is given by the thresholded expansion determined by s i τ only, i.e.
Here j is the column (or row) of the polynomial for the subgraph s i τ containing all edges from the core vertex i that corresponds to column i of the complete matrix polynomial on the left-hand side. x[s i τ ] is the small dense principal submatrix that contains only the entries of X corresponding to s i τ . The full matrix P τ (X) can then be assembled, column by column, from the set of smaller dense matrix polynomials P (x[s i τ ]) for each vertex i. The calculation of a numerically thresholded matrix polynomial P τ (X) thus can be replaced by a sequence of fully independent small dense matrix polynomial expansions determined by a graph partitioning. Equation (6) represents an exact relation between a globally thresholded sparse matrix algebra and a graph partitioning approach, which is valid for a general matrix polynomial P (X). Several observations can be made about this equivalence: i) P τ (X) is not symmetric and with the order of the matrix product for the threshold in Eq. (5) we collect P τ (X) column by column in Eq. (6) as illustrated by the directed graph at the bottom of Fig. 1; ii) the accuracy of the matrix polynomial increases (decreases) as the threshold τ is reduced (increased) and the number of edges of S τ increases (decreases); iii) we may thus include additional edges in S τ without loss of accuracy; iv) the polynomial P τ (X) is zero at all entries outside of S τ ; v) apart from spurious cancellations, the non-zero pattern of P τ (X) is therefore the same as S τ and we can expect a numerically thresholded exact matrix polynomial, P (X) τ , to have a non-zero structure similar to S τ ; vi) the graph partitioning can be generalized such that each vertex corresponds to a combined set of vertices, i.e. a community, without loss of accuracy; vii) we may reduce the computational cost by identifying such communities using graph partitioning schemes, which can be tailored for optimal platformdependent performance; viii) the exact relation given by Eqs. (4) -(6) holds for any structure of S τ and is not limited to the threshold in Eq. (2); ix) the particular sequence of matrix operations in the calculation of P τ (X) is of importance because of the thresholding in Eq. (5), whereas the order (or grouping) of the matrix multiplications is arbitrary for the contracted matrix polynomials P (x[s i τ ]) in Eq. (6); and x) the computational cost of each polynomial expansion is dominated by separate sequences of dense matrix-matrix multiplication that can be performed independently and in parallel.
A main point of this paper is that the equivalence between the calculation of the thresholded sparse matrix polynomial and the graph partitioned expansion in Eq. (6) Here D is the density matrix, H the Hamiltonian, µ the chemical potential, and β the With f n (X) being 2nd-order polynomials 35 we reach an expansion order of over a billion in only 30 iterations. The ability to use a fast recursive expansion is motivated from ix) above, and since any recursive expansion also can be written in the general form of Eq. (1).
Once the density matrix D is known, the expectation value of any operator A is given by . Generalizations to quantum perturbation theory are straightforward 53,54 .
B. Macromolecular test system A similar optimization can be performed for divide and conquer methods, but may not be applicable to inhomogeneous systems 17 . Figure In our MD simulation below we use the symbolic representation of S τ (t) in Eq. (8), which is given from the non-zero pattern of the thresholded density matrix (with τ = 10 −4 ) combined with the non-zero pattern of H(t), and instead of the matrix square we use paths of length two, corresponding to the symbolic operation ( = 0). This approach that adapts S τ (t) to each new MD time step by including additional redundant edges works surprisingly well, though S τ (t) cannot increase by more than paths of length two between two MD steps.
Generalized estimates can also be designed for the iterative SCF optimization. Figure 4 shows the fluctuations of the total energy during a microcanonical MD simulation of liquid water that was performed using LATTE 58 and the extended Lagrangian formulation of Born-Oppenheimer MD 50,66-69 . The density matrix was calculated from a partitioning over separate subgraphs of S τ (t), with one water molecule per core. For the Fermi-operator expansion (at zero temperature) we used the recursive SP2 algorithm 35 . In each time step the complete SP2 sequence (the same for each subgraph expansion) for the correct total occupation is pre-determined from the HOMO-LUMO gap that is estimated from the previous time step as in Ref. 41 . In this way each expansion can be performed independently, without exchange of information during or between the matrix multiplications as otherwise would be required 8, 28 . Communication is reduced to a minimum and no additional adjustments of the electronic occupation, as in divide and conquer calculations 14 , is required.
The inset of Fig. (4) shows the number of water molecules of a single subgraph (core + halo)  42,74 . The problem is illustrated in Fig. 5, which shows a comparison between a divide and conquer approach and our graph-based calculation of the density matrix for a snapshot from a MD simulation of the water system in Fig. 4. Without the adaptivity of the graph-based method, the divide and conquer approach needs a large cutoff radius, R cut , to reach sufficient convergence in the calculation of the density matrix for the water system, which leads to a significant overhead. With the graph-based framework as demonstrated here in combination with a modern formulation of Born-Oppenheimer MD [42][43][44][45][46][47][48][49][50] , these problems can be avoided.
The off-the-shelf METIS graph partitioning scheme used for the macromolecular system in Figs. 2 and 3 works very well and drastically reduces the overhead compared to a straightforward implementation. However, by adjusting the graph partitioning to the particular requirements of the electronic structure calculation as well as the computational platform, further optimizations are possible 75 . We may also incorporate an on-the-fly graph partitioning that updates the subgraphs during a MD simulation as the structure changes.

IV. CONCLUSIONS
In this article we have shown how graph theory can be combined with quantum theory to calculate the electronic structure of large complex systems with well-controlled accuracy.
The graph formalism is general and applicable to a broad range of electronic structure methods and materials, for which sparse matrix representations can be used, including QMD simulations, overcoming significant gaps in linear scaling electronic structure theory.  the red dashed border in the inset (associated with the data-dependency graph S τ for the large red molecule at the center), the cutoff radius needs to be large, which leads to a significant overhead for the divide and conquer approach. The efficiency would be similar only for a homogeneous system.
The computational cost was estimated from the sum of the number arithmetic operations (a.o.) required to calculate the density matrices from all the separate subgraph partitions or divide and conquer regions -one for each water molecule.