NECI: N-Electron Configuration Interaction with emphasis on state-of-the-art stochastic methods

We present NECI, a state-of-the-art implementation of the Full Configuration Interaction Quantum Monte Carlo algorithm, a method based on a stochastic application of the Hamiltonian matrix on a sparse sampling of the wave function. The program utilizes a very powerful parallelization and scales efficiently to more than 24000 CPU cores. In this paper, we describe the core functionalities of NECI and recent developments. This includes the capabilities to calculate ground and excited state energies, properties via the one- and two-body reduced density matrices, as well as spectral and Green's functions for ab initio and model systems. A number of enhancements of the bare FCIQMC algorithm are available within NECI, allowing to use a partially deterministic formulation of the algorithm, working in a spin-adapted basis or supporting transcorrelated Hamiltonians. NECI supports the FCIDUMP file format for integrals, supplying a convenient interface to numerous quantum chemistry programs and it is licensed under GPL-3.0.

We present NECI, a state-of-the-art implementation of the Full Configuration Interaction Quantum Monte Carlo algorithm, a method based on a stochastic application of the Hamiltonian matrix on a sparse sampling of the wave function. The program utilizes a very powerful parallelization and scales efficiently to more than 24000 CPU cores. In this paper, we describe the core functionalities of NECI and recent developments. This includes the capabilities to calculate ground and excited state energies, properties via the one-and two-body reduced density matrices, as well as spectral This article has been accepted by the Jouranl of Chemical Physics, after it is published, it will be found at https://aip.scitation.org/journal/jcp. a) Electronic mail: k.guther@fkf.mpg.de b) Electronic mail: a.alavi@fkf.mpg.de

I. INTRODUCTION
NECI started off in the late 1990s as an exact diagonalisation code for model quantum dots 1,2 , and has evolved into a code to perform stochastic diagonalisation of large fermionic systems in finite but large quantum chemical basis sets, using the Full Configuration Interaction Quantum Monte Carlo (FCIQMC) algorithm 3 . This algorithm samples Slater determinant (i.e. antisymmetrized) Hilbert spaces using signed walkers, by propagation of the walkers through stochastic application of the second-quantized Hamiltonian onto the walker population. In philosophy, it is similar to the continuum real-space Diffusion Monte Carlo (DMC) algorithm. However, unlike DMC, no fixed node approximation needs to be applied. Instead, the nodal structure of the wavefunction, as encoded by the signed coefficients of the sampled Slater determinants, emerges from the dynamics of the simulation itself. However, being based on an FCI parametrization of the wave function, the FCIQMC method exhibits a steep scaling with the number of electrons and is thus only suited for relatively small chemical systems compared to those accessible to DMC. While the common energy measures in FCIQMC methods, namely the projected, trial energies (cf section IV) and the energy "shift", are not variational, a variational energy can be computed from two parallel FCIQMC calculations either directly (cf section VI), or via the reduced density matrix (RDM) based energy estimator (cf section VII).
There are also similarities between the FCIQMC approach and the Auxiliary-Field Quantum Monte Carlo (AFQMC) method 4-6 , both being stochastic projector techniques formulated in second quantized spaces. The latter however works in an over-complete space of non-orthogonal Slater determinants and relies on the phase-less approximation 7 to eliminate the phase problem associated with the Hubbard-Stratonovich transformation of the Coulomb interaction kernel, the quality of this approximation being reliant on the trial wavefunction used to constrain the path. The objective of AFQMC is the measurement of observables such as the energy by sampling over the Hubbard-Stratonovich fields. FCIQMC on the other hand works in a fixed Slater determinant space and relies on walker annihilation to overcome the fermion sign problem. The phase-less approximation renders the AFQMC method polynomial scaling, with an uncontrolled approximation, while i-FCIQMC, which is an in principle exact method, remains exponential scaling. Finally FCIQMC provides a direct measure of the CI amplitudes of the many-body wavefunction expressed in the given orbital basis, from which observables can be computed including elements of reduced density matrices (which do not commute with the Hamiltonian) via pure estimators. Exact symmetry constraints, including total spin, can be incorporated into the formalism 8 . In this sense, the FCIQMC method is closer in spirit to multi-reference CI methods used in quantum chemistry to study multi-reference problems rather than the AFQMC method.
In its original formulation, the algorithm is guaranteed to converge onto the ground-state wavefunction in the long imaginary-time propagation limit, provided a sufficient number of walkers is used. This number is generally found to scale with the Hilbert space size, and is a manifestation of the sign-problem in this method, essentially implying an exponential memory cost in order to guarantee stable convergence onto the exact solution. In the subsequent development of the initiator method (i-FCIQMC) 9 , this condition was relaxed to allow for stable simulations at relatively low walker populations, much smaller than the full Hilbert space size, albeit at the cost of a systematically improvable bias. While the initiator adaptation removes the strict need for a minimum walker number, it does not eliminate the exponential scaling of the method, such that calculations become more and more challenging with increasing system size. To give an idea of the capabilities of the NECI implementation, estimates for the accessible system sizes are given below. The rate of convergence of the initiator error with walker number has been found to be slow for large systems. This is a manifestation of size-inconsistency error which generally plagues linear Configuration Interaction methods. A very recent development of the adaptive shift method 10 , mitigates this error substantially, enabling near-FCI quality results to be obtained for systems as large as benzene.
The development of the semi-stochastic method by Umrigar et al. 11 and its further refinements 12 dramatically reduced the stochastic noise and hence improved the efficiency of the method.
The FCIQMC algorithm, as well as its semi-stochastic and initiator versions, are scalable on large parallel machines, thanks to the fact that walker distribution can be distributed over many processors with relatively small communication overhead. The methods, however, are not embarrassingly parallel, owing to the annihilation step of the algorithm (see also figure 1). For this reason, parallelisation over very large numbers of processors is a highly non-trivial task, but substantial progress has been made, and here we show that efficient parallelisation up to more than 24000 CPU cores can be achieved with the current NECI code.
The FCIQMC method has been generalised to excited states 13 of the same symmetry as the ground state and to the calculation of pure one-and two-particle reduced density matrices via the "replica-trick" 14-17 (and more recently three and four-particle RDMs 18 ). A number of stochastic methods have been developed as an extension or variation of the FCIQMC approach. These include density matrix quantum Monte Carlo (DMQMC), which allows the exact thermal density matrix to be sampled at any given temperature, and also allows straightforward estimation of general observables, including those which do not commute with the Hamiltonian 16, 26 . Applications of DMQMC include providing accurate data for the warm dense electron gas 27 . Although not implemented in NECI, DMQMC is available in the HANDE-QMC code 28 .
The FCIQMC method has lead to the development of a number of highly efficient deterministic selected CI methods, including the adaptive sampling CI method of Head-Gordon and co-workers 29 , who also establish the connection with the much older perturbatively selected CIPSI method of Malrieu et al 30 but with a modified search procedure, while the Heat-Bath CI method of Umrigar and coworkers 31 was developed from the Heat-Bath excitation generation for FCIQMC 32 together with an initiator-like criterion to select the connected determinants with extreme efficiency. Later a sign-problem-free semi-stochastic evaluation of the Epstein-Nesbet perturbation energy was developed by Sharma et al 33 to compute the missing dynamical correlation energy at second-order in a memory and CPU efficient manner. Other highly related developments to FCIQMC which originate in the numerical analysis literature include the Fast Randomised Iteration 34 and further developments by Weare, Berkelbach and coworkers 35 , and co-ordinate descent FCI of Lu and coworkers 36 .
Depending on the utilized features, the number of electrons and accessible basis sizes can vary. The i-FCIQMC implementation including the semi-stochastic version is highly scalable and has been successfully applied to Hilbert space sizes of up to 10 108 with 54 electrons 37 . Atomic basis sets up to aug-cc-pCV8Z for first-row atoms (1138 spin orbitals) are treatable 38 . Reduced density matrices can routinely be calculated for use in accurate Stochastic-MCSCF 19 for active spaces containing up to 40 electrons and 38 spatial orbitals 39,40 . Real-time calculations are computationally more demanding but can still be performed for first-row dimers using cc-pVQZ basis sets 21 . For the similarity transformed FCIQMC method, the limiting factor is not the convergence of the FCIQMC, but the storage of the three-body interaction terms imposing a limit of ∼ 100 spatial orbitals on currently available hardware 24 . Optimized implementations for the application to lattice model systems, like the Hubbard 41 (in a real-and momentum space formulation), t − J and the Heisenberg models for a variety of lattice geometries, are implemented in NECI . The applicability of FCIQMC to the Hubbard model strongly depends on the interaction strength U/t. For the very weakly correlated regime U/t < 1, FCIQMC is employable up to 70 lattice sites 42 , using a momentum-space basis. In the interesting, yet most problematic, intermediate interaction strength regime in two dimensions, the transcorrelated (similaritytransformed) FCIQMC is necessary to obtain reliable energies in systems up to 50 sites (at and near half-filling) 23 .
The FCIQMC algorithm as implemented in NECI is based on a sparse representation of the wave function and a stochastic application of the Hamiltonian. We start with the full wave function with coefficients C i in a many-body basis |D i . NECI supports Slater determinants or CSFs as a many-body basis, for simplicity, for now, the usage of determinants is assumed, but the algorithm is analogous for CSFs, see also section XII B 2. The FCIQMC wave function is not normalized. The ground state of a HamiltonianĤ is now obtained by iterative imaginary time-evolution, with the propagator expanded to first order using a discrete time-step ∆τ such that which converges to the ground state ofĤ for τ → ∞ for |∆τ | < 2 W where W is the difference between the largest and smallest eigenvalue ofĤ 43 . Here, S(τ ) is a diagonal shift applied toĤ, which is iteratively updated to match the ground state energy.
The full wave function is stored in a compressed manner, where only coefficients above a given threshold value C min are stored. Coefficients smaller than C min are stochastically rounded. That is, in every iteration, a wave function given by coefficients C i (τ ) is stored such that such that C i (τ ) = C i . This compression is applied in every step of the algorithm that affects the coefficients. The value |C i (τ )| is referred to as walker number of the determinant Applying the Hamiltonian to this compressed wave function is done by separating it into a diagonal and an off-diagonal part as and then performed in the three labeled steps (a) − (c). First, in the spawning step, the off-diagonal part is evaluated by stochastically sampling the sum over j, storing the resulting spawned wave-function as a separate entity as described in the flow chart in Figure 1.
Then, in the death step, the diagonal contribution is evaluated deterministically, following a stochastic rounding of the resulting coefficients. This step is performed in-place, since the coefficients of the previous iterations are not required anymore. Finally, the spawned wave-function from the off-diagonal part is added in the annihilation step, summing up all contributions from the spawned wave-function to each determinant. NECI implements the initiator method, too, which labels a class of determinants as initiators, typically those with an associated walker number above a given threshold, and effectively zeroes all matrix elements between non-initiator determinants and determinants with C i (τ ) = 0. The implementation thereof is also sketched in figure 1.
In the context of FCIQMC calculations, the core functionality of NECI consists of a highly parallelizable implementation of the initiator FCIQMC method 9 for both real and complex Hamiltonians. There is both a generic interface for ab initio systems, specialized implementations for the Hubbard and Heisenberg models, as well as the uniform electron gas.
The interface for passing input information on the system to NECI is discussed in section XIV.

II. EXCITATION GENERATION
A key component of the FCIQMC algorithm is the sampling of the Hamiltonian matrix elements in the spawning step, where the Hamiltonian is applied stochastically. This requires an efficient algorithm to randomly generate connected determinants with a known probability p gen for any given determinant, referred to as excitation generation. This typically means making a symmetry constrained choice of (up to) two occupied orbitals in a determinant and (up to) two orbitals to replace them with, such that the corresponding Hamiltonian matrix element is non-zero. If spin-adapted functions are used rather than determinants, the connectivity rules change but the main principles are same.
The spawning probability for a spawn from a determinant |D i to a determinant |D j is in practice given by This means, the purpose of p gen (j|i) of selecting |D j from |D i in the spawning probability p s is to allow the flexibility in the selection of determinants |D j from |D i so that, irrespective of how we choose |D j from |D i , the rate at which transitions occur is not biased by the selection algorithm. In other words, if a particular determinant |D j is only selected rarely from |D i (i.e. with low generation probability), then the acceptance of the move (i.e. the spawning probability) will be with correspondingly high probability (i.e. proportional to the inverse of the generation probability). Conversely if a determinant |D j is selected with relatively high generation probability from |D i , then its acceptance probability will be correspondingly low. In other words, from the point of view of the exactness of the FCIQMC algorithm, the precise manner in which excitations are made is immaterial: as long as the probability p gen (j|i) > 0 when |H ij | > 0, the algorithm will ensure that transitions from as an optimized special case thereof, which only takes into account the maximal ratio 53 . If required, internal weights of the excitation generators such a bias towards double excitations are then optimized in the same fashion to maximise the time-step.
However, as a result, the time-step and thus overall efficiency of the simulation is driven by the worst-cases of the |H ij | pgen(j|i) ratio discovered within the explored Hilbert space. Thus an optimal excitation generator should create excitations with a probability distribution to the Hamiltonian matrix elements, such that This is the optimal probability distribution, since then, the acceptance rate is solely determined by the time step 32 .
NECI supports a variety of algorithms to perform excitation generation, with the most notable being the pre-computed heat-bath (PCHB) sampling (a variant of the heat-bath sampling presented in 32 , as described in the appendix A 3), the on-the-fly Cauchy-Schwartz method 54 (described in the appendix A 2), the pre-computed Power-Pitzer method 55 and lattice-model excitation generators both for real-space and momentum-space lattices. Additionally, a three-body excitation generator and a uniform excitation generator are available, which are essential for treating systems with the transcorrelated ansatz when including three-body interactions.
As heat-bath excitation generation can have high memory requirements, it might be impractical for some systems. There, the on-the-fly Cauchy-Schwartz method can maintain very good |H ij | pgen(j|i) ratios without significant memory cost, albeit at O(N) computational cost, N being the number of orbitals, and possibly with dynamic load imbalance. The details of the Cauchy-Schwartz excitation generation are discussed in the appendix.

III. SEMI-STOCHASTIC FCIQMC
In many chemical systems the wave function is dominated by a relatively small number of determinants. In a stochastic algorithm, the efficiency can be improved substantially by treating these determinants in a partially deterministic manner.
Petruzielo et al. suggested a semi-stochastic algorithm 11 , where the FCIQMC projection operatorP = ij P ij |D i D j |, is applied exactly within a small but important subspace, which we call the deterministic space, D. Specifically, we writê whereP D = i∈D, j∈D TheP D operator therefore accounts for all spawnings which are both from and to determinants in D. The stochastic projection operator,P S , contains all remaining terms. The matrix elements ofP D are calculated and stored in a fixed array, and applied exactly each iteration by a matrix-vector multiplication. The operatorP S is then applied stochastically by the usual FCIQMC spawning algorithm.
The semi-stochastic adaptation requires storing the Hamiltonian matrix within D, which we denote H D . In NECI, H D is stored in a sparse format, distributed across all processes. To calculate H D , we have implemented the fast generation scheme of Li et al. 56 This approach has allowed us to use deterministic spaces containing up to ∼ 10 7 determinants. However, a more typical size of D is between 10 4 and 10 5 .
Ideally, a deterministic space of a given size (N D ) should be chosen to contain the determinants with the largest value of |C i | in the exact FCI wave function. This optimal choice is not possible in practice, but various approaches exist to make an approximate selection.
Umrigar and co-workers suggest using selected configuration interaction (SCI) to make the selection. 11 Within NECI, the most common approach is to choose the N D determinants which have the largest weight in the FCIQMC wave function, at a given iteration. 12 Therefore, a typical FCIQMC simulation in NECI will be performed until convergence (at some iteration number N conv. ) using the fully-stochastic algorithm, at which point the semi-stochastic approach is turned on, selecting the N D most populated determinants in the instantaneous wave function to form D. The appropriate parameters (N D and N conv. ) are specified in the NECI input file. NECI supports performing periodic re-evaluation of the N D most populated determinants, updating the deterministic space D with a given frequency.
Using the semi-stochastic adaptation with a moderate deterministic space (on the order of ∼ 10 4 ) can improve the efficiency of FCIQMC by multiple orders of magnitudes. This is particularly true in weakly correlated systems. The semi-stochastic approach can also be used in NECI when sampling reduced density matrices (RDMs) as described in section VII. Here, contributions to RDMs are included exactly between all pairs of determinants within D. It has been shown that this can substantially reduce the error on RDM-based estimators. 12 Using the semi-stochastic adaptation in NECI disables the load-balancing unless a periodic update of D is performed.

IV. TRIAL WAVE FUNCTIONS
The most common energy estimator used in FCIQMC is the reference-based projected estimator, where |D Ref is an appropriate reference determinant (usually the Hartree-Fock determinant). In case |Ψ is an eigenstate, this yields the exact energy, but in general it is a non-variational estimator. This is the default estimator for the energy, and can be obtained with minimal overhead.
NECI has the option to use projected estimators based on more accurate trial wave functions, which can significantly reduce statistical error in energy estimates. For this reason we define a trial subspace T , which is spanned by N T determinants. Similarly to the deterministic space, T should ideally be formed from the determinants with the largest contribution in the FCI wave function, or some good approximation to these determinants. Projectinĝ H into T gives us a N T × N T matrix, which we denote H T , whose eigenstates can be used as trial wave functions for more accurate energy estimators.
Let us denote an eigenstate of H T by |Ψ T = i∈T C T i |D i , with eigenvalue E T . Then a trial function-based estimator can be defined as Here, C is the space of all determinants connected to T by a single application ofĤ (not including those in T ). C i denotes walker coefficients in the FCIQMC wave function, and V j is defined within C as To calculate the estimator E Trial we therefore require several large arrays: first, H T , which is stored in a sparse format, in the same manner as the deterministic Hamiltonian in the semi-stochastic scheme; second, |ψ T , which must be calculated by the Lanczos or Davidson algorithm; third, V , which is a vector in the entire C space. The number of coefficients to store in C is larger than in T by a significant amount, typically by several orders of magnitude. Indeed, storing V can become the largest memory requirement. Because of this, using trial wave functions is typically more memory intensive in NECI than using the semi-stochastic approach, for a given space size. We therefore suggest using a smaller trial space, T , compared to the deterministic space, D.
Note that the initiator error on E Trial is not the same as the initiator error on E Ref . For example, E Trial becomes exact as |Ψ T approaches the FCI wave function. For practical trial wave functions, however, the two energy estimates typically give similar initiator errors for ground-state energies in our experience. An exception occurs for excited states (see Section VIII). In this case, the wave function is usually not well approximated by a single reference determinant, and E Trial with an appropriate T yields a great improvement, both for the statistical and initiator error.

V. ADAPTIVE SHIFT
The initiator criterion 9 is important in making FCIQMC a practical method allowing us to achieve convergence at a dramatically lower number of walkers than the full FCIQMC 3 .
However, this approximation introduces a bias in the energy when an insufficient number of walkers is used. This bias can be attributed to the fact that non-initiators are systematically undersampled due to the lack of feedback from their local Hilbert space. To correct this, we can allow each non-initiator determinant |D i to have its own local shift S i (τ ) as an appropriate fraction of the full shift S(τ ) The fraction f i is computed by monitoring which spawns are accepted due to the initiator criterion and accumulating positive weights over the accepted and rejected ones: These weights w ij are derived from perturbation theory 57 where the first-order contribution of determinant |D i to the amplitude of determinant |D j is used as a weight for spawns It is worth noting that, regardless of how the weights are chosen, expression (14) guarantees that initiators get the full shift. Also as the number of walkers increases, the local Hilbert space of a non-initiator becomes more and more populated, restoring the full method in the large walker limit.
We call the above approach for unbiasing the initiator approximation, the adaptive-shift has a bias of over 10 mH. Also notice how by using the adaptive shift, the results become, to a large extent, independent of the initiator parameter n a .

VI. PERTURBATIVE CORRECTIONS TO INITIATOR ERROR
An alternative approach to removing initiator error in NECI is through a perturbative correction 60 . In the initiator approximation, spawning events from non-initiators to unoccupied determinants are typically discarded. These discarded events make up a significant fraction of all spawning attempts made, which in turn accounts for much of the total simulation time. While it is necessary to discard these spawned walkers to prevent disastrous noise from the sign problem 61 , this step is extremely wasteful.
These discarded walkers actually contain significant information which can be used to greatly increase the accuracy of the initiator FCIQMC approach. Specifically, these walkers Adaptive Shift, n a = 20 Normal Initiator, n a = 10 Adaptive Shift, n a = 10 Normal Initiator, n a = 3 Adaptive Shift, n a = 3 DMRG -6000 may sample up to double excitations from the currently-occupied determinants (a similar argument can be used to justify the above adaptive shift approach). In analogy with a comparable approach taken in selected CI methods, these discarded walkers can be used to sample a second-order correction to the energy from Epstein-Nesbet perturbation theory.

The correction is calculated by
Here, ∆τ is the time step, E 0 is the i-FCIQMC estimate of the energy, and S r i is the total spawned weight onto determinant |D i in replica r (the replica approach will be discussed in more detail in Section VII). This correction requires that two replica FCIQMC simulations are being performed simultaneously, to avoid biases in this estimator. The summation here is performed over all spawning attempts which are discarded on both replicas simultaneously.
This must only be applied to correct the variational energy estimator from i-FCIQMC.
Such variational energies in NECI can either be calculated directly 62,63 , or from two-body reduced density matrices, which may be sampled in FCIQMC.
This perturbative correction is essentially free to accumulate, since all spawned walkers contributing to Eq. (16) are created regardless. The only significant extra cost comes from the requirement to perform two replica simulations. However, for large systems the noise on this correction can become significant, which necessitates further running time to reduce statistical errors. This correction has proven extremely successful in practice, particularly for weakly correlated systems, where it is typical to see 80 − 90% of remaining initiator error removed 60,62,63 .

VALUES
While the total energy is an important quantity to extract from quantum systems, a more complete characterization of a system requires the ability to extract information about other expectation values. If these expectation values are derived from operators which do not commute with the Hamiltonian of the system, then a 'projected' estimate of the expectation value akin to Eq. 9 is not possible, and alternatives within FCIQMC are required in order to compute them. This is the case for many key quantities such as nuclear derivatives (forces on atoms), dipole moments and higher-order electrical moments, as well as other observables such as pair distribution functions 64 . They all can be obtained via the corresponding n-body reduced density matrix (n-RDM), where n is the rank of the operator in question, that fully characterizes the correlated distribution and coherence of n electrons relative to each other. This information can also be used to calculate quantum information measures, which are not observables but which characterize the entanglement within a system, such as correlation entropies 15 .
To characterize the strength of coupling between different states under certain operators, e.g. the oscillator strength of optical excitations, as well as obtaining other dynamical information requires computing transition density matrices (tRDMs) between stochastic samples of different states, which can be sampled within FCIQMC using the excited state feature discussed in section VIII 17,65 . Furthermore, the two states considered may not sample eigenstates of the system, but one of them can be a response state of the system, then the resulting tRDMs characterize the response of a system to a perturbation, corresponding to a higher derivative of the energy such as the polarizability of the system, which will be addressed in section IX 66 . Finally, RDMs can also be used to characterize the expectation value of an effective Hamiltonian in a subspace of a system 67,68 . This effective Hamiltonian can include effects such as electronic correlations coupling the space to a wider external set of states.
The plurality of electronic structure methods of this kind, such as explicitly correlated 'F12' corrections for basis set incompleteness [69][70][71] ; multi-configurational self-consistent field 19,20 ; internally-contracted multireference perturbation theories 18 ; embedding methods 72,73 ; and the Multi-Configuration Pair-Density Functional Theory (MC-PDFT) 74 , further attest the importance of faithful and efficient sampling of RDMs in electronic structure theory.
All expectation values of interest can be derived from contractions with a general reduced density matrix object, defined as where n denotes the 'rank' of the RDM, and the choice of the states A and B define the type of RDM, as described above. In this section we focus on the sampling of the 2-RDM.
This is generally the most common RDM required, as most expectation values of interest are (up to) two-body operators, including the total energy of the system. Furthermore, within FCIQMC, the fact that the rank of the RDM required is then the same as the rank of the Hamiltonian which is sampled within the stochastic dynamics, leads to a novel algorithm which ensures that the overhead to compute the 2-RDM is relatively small and manageable 15 .
Expanding the expression for the 2-RDM in terms of the exact FCI wave function (Eq. 1), we find where i, j index the many-electron Slater determinants and k, l, m, n denote single-particle orbitals. We will focus on the case where we are sampling |Ψ A = |Ψ B = |Ψ 0 , the ground state of the system, since the same basic principles are applied to sampling the tRDMs, where the other walker distribution may represent an excited state or a response state, with more details for these cases considered in Refs A final point to note, is that the n-RDM is a non-linear functional of the FCI amplitudes -specifically being a quadratic form. Within the FCIQMC sampling, the C i amplitudes are stochastic variables represented as walkers (C i (τ )) which at any one iteration are in general very different from the true C i , but when averaged over long times have an expected mean amplitude which is the same as (or a very good approximation to) C i . However, due to this non-linearity in the form of the 2-RDM, the average of the sampled amplitude product is not equal to the product of the average amplitude, The result is that even if each determinant were correctly sampled on average, the stochastic error in the sampling would manifest as systematic error in the RDMs, and thus only give correct results in the large walker limit, but not the large sampling limit, even if the wave function were correctly resolved.
The resolution to this problem came via the 'replica trick' 15,16 , which changes the quadratic RDM functional into a bilinear one 14 . This formally removes the systematic error in the RDM sampling, at the expense of requiring a second walker distribution. The premise is to ensure that these two walker distributions are entirely independent and propagated in parallel, sampling the same (in this instance ground-state) distribution. This ensures an unbiased sampling of the desired RDM, by ensuring that each RDM contribution is derived from the product of an uncorrelated amplitude from each replica walker distribution. The sampling algorithm then proceeds by ensuring that during the spawning step, the current amplitudes are packaged and communicated along with any spawned walkers.
During the annihilation stage, these amplitudes are then multiplied by the amplitude on the child determinant from the other replica distribution, and this product then contributes to all n-RDMs which are accumulated, and equal to the rank of the excitation or higher.
In this way, the efficient and parallel annihilation algorithm is used to avoid latency of The dominant cost of RDM sampling in large systems comes from the sampling of elements defined by pairs of creation and annihilation operators with the same orbital index.
These correspond to tuples of occupied orbitals common to both |D i and |D j states. We term these contributions promotions, as they contribute to a rank of a RDM greater than the excitation level between |D i and |D j . For instance, single excitation spawning events operations required to sample these promoted contributions from the diagonal of Eq. 18.
Other efficiency boosting modifications to the algorithm, such as the semi-stochastic adaptation 12 (detailed in Sec III) are also seamlessly integrated with the RDM accumulation. Within the deterministic core space the RDM contributions are exactly accumulated along with the exact propagation, with the connections from the deterministic to the stochastic spaces sampled in the standard fashion. This combination of RDM sampling with the semi-stochastic algorithm can greatly reduce the stochastic errors in the RDMs by ensuring that contributions from large weighted determinant amplitudes are explicitly and deterministically included. Furthermore, the reference determinant and its direct excitations are also exactly accumulated. This is partly because these are likely important contributions, but principally, if the reference is a Hartree-Fock determinant then the coupling to its single excitations via the Hamiltonian will be zero due to Brillouin's theorem. These single excitations will nevertheless contribute to the RDMs, and therefore are included explicitly.
The sampling of RDMs with a rank greater than two is also now possible within the FCIQMC algorithm and NECI code. The importance of these quantities is primarily in their use in internally-contracted multireference perturbation theories, although a number of other uses for these quantities also exist 18 . These methods allow for the FCIQMC dynamics to only consider an active orbital subspace, hugely reducing both the full Hilbert space of the stochastic dynamics as well as the required timestep, while the accumulation of up to 4-RDMs (or contracted lower-order intermediates for efficiency) allows for a rigorous coupling of the strong correlation in the low-energy active space to the dynamic correlation in the wider 'external' space via post-processing of these higher-body RDMs with integrals of the external space. Sampling of higher-body RDMs cannot use the identical algorithm to the 2-RDMs, since it now requires the product of determinant amplitudes separated by up to 4-electron excitations, which are not explicitly sampled via the standard FCIQMC propagation algorithm. To allow for this sampling, we include an additional spawning step per walker of excitations with a rank between three and n, where n is the rank of the highest RDM accumulated. This additional spawning is controlled with a variable stochastic resolution, ensuring that the frequency of these samples is relatively rare to control the cost of sampling these excitations (approximately only one higher-body spawn for every 10-20 traditional (up to two-body) spawning attempts). There is no timestep associated with these excitations, and every attempt is 'successful', transferring information about higherbody correlations in the system and contributing to these higher-body excitations, but not modifying the distribution of the sampled wave function. However, the dominant cost of sampling these higher-body RDMs is not the sampling events themselves, but rather the promotion of lower-rank excitations to these higher-body intermediates. Nevertheless, the faithful sampling of these higher-body properties has allowed for the stochastic estimate of fully internally-contracted perturbation theories in large active spaces, with similar number of walkers required to sample the 2-RDM in an active space 18 .

VIII. EXCITED STATE CALCULATIONS
In many applications, besides ground state energies, the properties of excited states are of interest. If states in different symmetry sectors are targeted, this can be easily achieved by performing separate calculations in each sector, yielding the ground state with a given symmetry. If, however, several eigenstates with the same symmetry are required, then this approach is not sufficient. The FCIQMC method is not inherently limited to ground state calculations, and can employ a Gram-Schmidt orthogonalization technique to calculate a set of orthogonal eigenstates 13,17 . The obtained states will then be the lowest energy states with a given symmetry.
Calculating eigenstates sequentially and orthogonalizing against all previously calculated states carries the problem of only orthogonalizing against a single snapshot of the wave function, which will lead to a biased estimate of the excited states. Instead, calculating all states in parallel and orthogonalizing after each iteration gives much better results.
with the orthogonalization operator for the n-th statê With this definition of the orthogonalization operator, the ground state FCIQMC wave function (n = 0) is left unaffected. The first excited state (n = 1) is then orthogonalized against the ground state (using the updated wave functions at τ + ∆τ , after annihilation has been performed). The second excited state is orthogonalized against both the ground and first excited state, and so on.
To enforce the FCIQMC wave function discretization, after performing the orthogonalization, all determinants with a coefficient smaller than the minimal threshold (typically 1) are stochastically rounded (either down to 0 or up to 1, in an unbiased manner). This is required to prevent proliferation of very small walkers, which adversely affects the wave function compression.

IX. RESPONSE THEORY WITHIN FCIQMC TO CALCULATE STATIC MOLECULAR PROPERTIES
Response theory is a well-established formalism to calculate molecular properties using quantum chemical methods [75][76][77][78] . It is, in general, formulated for a time-dependent field which allows to compute both static and dynamic molecular properties. However, it is currently only implemented for a static field within NECI 66 .
Calculation of molecular properties using response theory relies on the evaluation of the response vectors which are the first or higher order wave functions of the system in the presence of an external perturbationV . According to Wigner's "(2n+1)" rule, response vectors up to order n are required to obtain response properties up to order 2n + 1 77 . For calculating second-order properties such as dipole polarizability, the first-order response vector, C (1) , needs to be obtained along with the zero-order wave function parameter C (0) .
While C (0) uses the original FCIQMC working equation 4, C (1) is updated according to The response vector is discretized into signed walkers in the same way it is done for C (0) . The dynamics of the response-state walker is simulated according to Eq. 21 using an additional pair of replica and it works in parallel with the dynamics of the zero-order state.
Additional spawning and death steps are devised for the response-state walker dynamics, due to the presence of the perturbation, alongside the original spawning and death steps in the dynamics. The dependence of the response state on the zero-order states comes from these two aforementioned additional steps. A Gram-Schmidt orthogonalization is applied to the response-state walker distribution with respect to the zero-order walker distribution at each iteration using the same functionality as described in section VIII. This ensures orthogonality of the response vectors with respect to all lower-order wave function parameters.
The norm of the response walkers is fixed by the choice of the normalization of the zeroorder walkers and it can, in principle, grow at a much faster rate than the zero-order norm.
Therefore, in Eq. 21 we introduce the parameter α to control the norm of the response walkers and to reduce the computational effort expended in simulating their dynamics. We aim at matching the number of response-state walkers (N w ) with the number of zero-order walkers (N (0) w ) by updating α periodically as Once the walker number stabilizes, the value of α is kept fixed, while accumulating statistics.
As α scales the norm of the response vector, it needs to be taken into account while evaluating response properties.
Response properties are then obtained from transition reduced density matrices (tRDMs) which are stochastically accumulated following Eq. 18. For example, dipole polarizability is obtained from the one-electron tRDMs between the zero-and first-order wave function as with the γ y p,q being calculated from the two-electron tRDM as Due to the use of two replica per state while sampling both zero-and first-order states, statistically independent and unbiased estimator of tRDMs can be constructed in two alternative ways which are denoted here as ' [1]' and ' [2]'. The perturbation used in the computation of the tRDMs in Eq. 24 is the dipole operatorŷ. The factor 1 α appears due to the re-scaling of the response vector following Eq. 21.

X. REAL-TIME FCIQMC
For the purpose of obtaining spectroscopic data or targeting highly excited states, the calculation of orthogonal sets of eigenstates quickly becomes unfeasible, as to obtain a certain eigenstate, all eigenstates of lower energy with the same symmetry have to be computed as well. Spectral functions and the resulting excitation energies can however be calculated using excited state calculations discussed in section VIII is mandatory, as is running with complex coefficients. The real-time propagation can be used to obtain energy gaps from spectral densities and thus target excited states. In contrast to the direct calculation of excited states, these have not to be calculated one by one and in order of ascending energy, however.
In Figure 3, a simple example for applying both the excited-state search and the real-time evolution to the Beryllium atom in a cc-pVDZ basis set to obtain the singlet-triplet gap of the lowest P-state is given. An issue with running real-time calculations is the difficulty of population control, as the death step is essentially replaced by a rotation in the complex plane. This issue can be mitigated by a rotation of the trajectory, evolving along a trajectory in complex plane. NECI supports an automated trajectory selection that updates the angle α of the time trajectory in the complex plane to maintain a constant population. The Green's function obtained in the complex plane can then be used to obtain real-frequency spectral functions using analytic continuation 79,80 , with the analytic continuation being significantly easier and more information being recoverable the closer to the real axis the trajectory is 21 . As, in contrast to the projector FCIQMC, errors arising from the expansion of the propagator are a concern when running complex-time calculations, NECI uses a second order Runge-Kutta integrator here, to sufficiently reduce the time-step error.

XI. TRANSCORRELATED METHOD
The computational cost of a Full CI method usually scales exponentially with respect to the size of the basis set. On the other hand, the low regularity of wave functions (characterized by the electronic cusp 83 ) causes a very slow convergence towards the basis set limit.
For calculations aiming at highly accurate results, it is very helpful to speed up such slow convergence.
A Jastrow Ansatz 84 offers a way to factor out the cusp from the wave function whereT = i<j u(r i , r j ) is a symmetric function (u(r i , r j ) = u(r j , r i )) over electron pairs, and |Φ is an anti-symmetric many-body function. By including the cusp term |r i − r j |/2  in u(r i , r j ), the regularity of |Φ is improved at least by one order over |Ψ 85 . We can also include other terms in u(r i , r j ) to capture as much dynamic correlations as possible.
By using variational quantum Monte Carlo methods (VMC), the pair correlation function u(r i , r j ) can be obtained for a single determinant |Φ (e.g., |Φ HF ) or a linear combination of small number of determinants (e.g., a small CAS wave function).

The transcorrelated method of Boys and Handy
The advantage of this form ofT is that the similarity transformation leads to an expansion which terminates at second order The similarity transformation introduces a novel two body operatorK and a three-body The whole transcorrelated Hamiltonian can be written in second quantised form as where h p q = φ p |h|φ q and V pq rs = φ p φ q |r −1 12 |φ r φ s are the one-and two-body terms of the molecular Hamiltonian, while K pq rs = φ p φ q |K|φ r φ s and L pqr stu = φ p φ q φ r |L|φ s φ t φ u originate from theK andL operators.
This transcorrelated method has been investigated by FCIQMC using NECI , as it can es- The matrix elements K pq rs and L pqr stu are pre-calculated and have to be supplied as input. The matrix elements of K can be passed combined with the Coulomb integrals, while the matrix elements of L are passed in a separate input file. This treatment is efficient for small atomic and molecular systems, but for large systems the storage of the L matrix becomes a bottleneck. Here, efficient low rank tensor product expansion of L, might in the future make it practical to treat even larger systems. NECI supports storage of L in a dense and a sparse format as well as on-the-fly calculation of L pqr stu from a tensor decomposition. Additionally, major technical changes to the FCIQMC implementation are required for sampling up to triple excitations, which generally leads to reduced time-steps. The development of efficient excitation generation, which can alleviate the time-step bottleneck, is the subject of current work.
This method has been tested on the first row atoms 24 , which shall serve as an example here. Two different correlation factors obtained by Schmidt and Moskowitz 87 based on variance-minimisation VMC, which contain 7 and 17 terms of polynomial type basis functions have been employed there. The 7 term factor (SM7) contains mainly electron-electron correlation terms together with some electron-nuclear terms, while the 17 term factor (SM17) uses more terms to describe also the electron-electron-nuclear correlation. For the full CI expansion of |Φ , three different basis sets, cc-pVDZ, cc-pVTZ and cc-pVQZ respectively have been used. In Fig. 4 the convergence of the total energies errors are displayed for the two different correlation factors, in comparison with the the CCSD(T)-F12 method. This demonstrates that improving the correlation factor can lead to a significant speed up of the basis set convergence. Using the 17 term factor, the CBS limit results can already be reached (within errors < 1 mH) using a cc-pVQZ basis sets.

XII. SYMMETRIES AND SPIN-ADAPTED FCIQMC
Symmetry is a concept of paramount importance in the description and understanding of physical and chemical processes. According to Noether's theorem there is a direct connection between conserved quantities of a system and its inherent symmetries. Thus, identifying them allows a deeper insight in the physical mechanisms of studied systems. Moreover, the usage of symmetries in electronic structure calculations enables a much more efficient formulation of the problem at hand. The Hamiltonian formulated in a basis respecting these symmetries has a block-diagonal structure, with zero overlap between states belonging to different 'good' quantum numbers. This greatly reduces the necessary computational effort to solve these problems and allows much larger systems to be studied.

NECI
There are several symmetries which are commonly used in electronic structure calculations, due to the above mentioned benefits and their ease of implementation. And our FCIQMC code NECI is no exception in this regard.

Conservation of theŜ z spin-projection
As mentioned in section I, FCIQMC is usually formulated in a complete basis of Slater determinants (SDs). SDs are eigenfunctions of the totalŜ z operator, and consequently, if the studied Hamiltonian,Ĥ, is spin-independent (no applied magnetic field and spin-orbit interaction) it commutes withŜ z , [Ĥ,Ŝ z ] = 0. The conservation of the m s eigenvalue in a FCIQMC calculation thus follows quite naturally: the initial chosen m s sector, determined by the starting SD used, will never be left by the random excitation generation process sketched in section II. No terms in the spin-conserving Hamiltonian will ever cause any state in the simulation to have a different m s value than the initial one. As a consequence the sampled wavefunction will always be an eigenfunction ofŜ z with a chosen m s , determined at the start of a calculation.

Discrete and Point Group Symmetries in FCIQMC
NECI is also capable of utilizing Abelian point group symmetries, with D 2h being the 'largest' spatial group (similar to other quantum chemistry packages, e.g. Molcas 88 and Molpro 89,90 ), momentum conservation (due to translational invariance) in the Hubbard model and uniform electron gas calculations and preservation of the m l eigenvalues of the orbital angular momentum operatorL z (the underlying molecular orbitals have to be constructed as eigenfunction ofL z ). This is implemented via a symmetry-conserving excitation generation step and is explained in more detail in Appendix A 1 a.

B. Total spin conservation
One important symmetry of spin-preserving, nonrelativistic Hamiltonians is the global SU(2) spin-rotation symmetry. However, despite the theoretical benefits, the total SU(2) spin symmetry is not as widely used as other symmetries, like translational or point group symmetries, due to their usually impractical and complicated implementation.
There are two kind of implementations of total spin conservation in our FCIQMC code NECI . One approximate one is based on Half-Projected Hartree-Fock (HPHF) functions 44,91-94 . Their rationale relies on the fact that for an even number of electrons, every spin state |S contains degenerate eigenfunctions with m s = 0. Using time-reversal symme-try arguments a HPHF function can be constructed as

The (graphical) Unitary Group Approach (GUGA)
Our full implementation of total spin conservation is based on the graphical Unitary group approach (GUGA). It relies on the observation that the spin-free excitation operatorŝ E ij in the spin-free formulation of the electronic Hamiltonian, have the same commutation relations, The last restriction in Eq. 36 corresponds to the fact that the (intermediate) total spin must never be less than 0.
The most important finding of Paldus and Shavitt 102,103 was that the Hamiltonian matrix elements-more specifically the coupling coefficients between two CSFs, e.g. m ′ |Ê ij |m -can be entirely formulated within the framework of the GUGA; without any reference and thus necessity to transform to a Slater determinant based formulation. Although CSFs can be expressed as a linear combination of SDs, the complexity of this transformation scales exponential with the number of open-shell orbitals of a specific CSF 104 . Thus, it is prohibitively hard to rely on such a transformation and for already more than ≈ 15 electrons a formulation without any reference to SDs is much more preferable.
Furthermore, Shavitt and Paldus 102,103 were able to find a very efficient formulation of the coupling coefficients as a product of terms, via the graphical extension of the UGA. Matrix elements between two given CSFs only depend on the shape of the loop enclosed by their graphical representation, as depicted in Fig. 5. The coupling coefficient of the one-body operatorÊ ij is given by where the product terms depend on the step-values of the two CSFs, d ′ k and d k , the difference in the current spin ∆S k (with the restriction S ′ k − S k = ±1/2) and the intermediate spin S k of |m at orbital k. Q k in Eq. (37) depends on the shape of the loop formed by |m and |m ′ at level k and is tabulated in e.g. Ref. [102]. Additionally, the two CSFs, |m and |m ′ , must coincide outside the range (i, j) for Eq. (37) to be non-zero.

Spin-adapted excitation generation -GUGA-FCIQMC
The compact representation of spin-adapted basis functions in form of step-vectors and the product form of the coupling coefficients (37) allow for a very efficient implementation in our stochastic FCIQMC code NECI. As mentioned in Sec. II, the excitation generation step is at the heart of any FCIQMC code.
The main difference to a SD-based implementation of FCIQMC, apart from the more involved matrix element calculation (37), is the higher connectivity within a CSF basis.
For a given excitation operatorÊ ij , with spatial orbital indices (i, j), there is usually more than one possible excited CSF |m ′ when applied to |m ,Ê ij |m = k c k |m ′ k . All valid spin-recouplings within the excitation range (i, j) can have a non-zero coupling coefficient as well. This fact is usually the prohibiting factor in spin-adapted approaches. However, there is a quite virtuous combination of the concepts of FCIQMC and the GUGA formalism, as

Example: Hydrogen chain in a minimal basis
The GUGA-FCIQMC method has been benchmarked 105 by applying it to a linear chain of L equidistant hydrogen atoms 106 recently studied to test a variety of quantum chemical methods 107 , which shall serve as an example here. Using a minimal STO-6G basis there is only one orbital per H atom and the system resembles a one-dimensional Hubbard model 41,108-110 with long-range interaction. Studying a system of hydrogen atoms removes complexities like core electrons or relativistic effects and thus is an convenient benchmark system for quantum chemical methods.

XIII. PARALLEL SCALING
When applying for access to large computing clusters, it is often necessary to demonstrate that the software being used (in this case NECI ) is capable of using the hardware efficiently.
Ideally, the speed-up relative to using some base number of compute cores should grow perfectly linearly with the number of cores. In 2014, Booth et. al. 94 , presented an example with 500×10 6 walkers in which no deviation from a linear speed-up is noticeable when    (31) comparing using 512 cores to using 32, and even at 2048 cores, a speed-up by a factor of 57.5 was reported, which is 90% of the ideal speed-up factor of 64. In that work, the largest number of cores explored was 2048. By comparing the performance for a calculation with 100×10 6 walkers and 500×10 6 walkers, the same figure showed that the speed-up became closer to the ideal speed-up when the number of walkers was increased, suggesting that when using even more walkers, the efficiency comes even closer to 100% of the ideal speed-up factor.
Since 90% of the ideal speed-up factor was achieved in 2014 with only 500×10 6 walkers on 2048 cores, and large compute clusters nowadays tend to have tens of thousands of cores available, we report scaling data for a much larger number of walkers on up to 24,800 cores in Table IV. The calculations were done using the integrals in FCIDUMP format for the (54e,54o) active space first described in 116 for the FeMoco molecule, and the output files are provided in the supplementary material 117 .
The scaling analysis presented in Table IV was done with 32 billion walkers on each of the two replicas used for the RDM sampling. Calculations at 32 billion walkers are expensive, so we only completed enough iterations to determine an accurate estimate of the average runtime per iteration for the scaling analysis, and not enough iterations to accurately estimate the energy.
One may ask whether or not the scaling observed in Table IV was performed for a reasonable number of walkers for this active space. To answer this question, we compare in  to energies obtained with i-FCIQMC at only 8 billion walkers/replica, and find that the i-FCIQMC-RDM and i-FCIQMC-PT2 energies are closer together than the sHCI-VAR and sHCI-PT2 energies, indicating that the i-FCIQMC energies are closer to the true FCI limit where the difference between variational and PT2 energies should vanish. The DMRG result lies about half-way between the two i-FCIQMC results, but fairly well below the lower of the sHCI results (a forthcoming publication specifically about the FeMoco system is planned, in which more details will be presented, but the purpose of this paper is to give an overview of the NECI code).
Furthermore, comparing the time per iteration between 8 × 10 9 and 32 × 10 9 walkers shows that a high parallel efficiency is also achieved at the lower walker number. The determinants in NECI are stored using a hash table, making i-FCIQMC linearly scaling in the walker number 94 , so the ideal time per iteration with 32 × 10 9 walkers at 19960 cores according to the result for 8 × 10 9 walkers at 16000 cores would be 23.4 seconds, which is only marginally smaller than the reported 23.5 seconds. Note however, that this is the relative efficiency between large scale calculations, which demonstrates performance gain from extending parallelization at large scales, not from parallelization over the entire range of scales, which is addressed to some extent by the Chromium dimer example below.
In the case of the Chromium dimer (cc-pVDZ, 28

A. Load balancing
The parallel efficiency of NECI is made possible by treating static load imbalance. NECI contains a load-balancing feature 28 , which is enabled by default and periodically re-assigns some determinants to other processors in order to maintain a constant number of walkers per processor. As can be seen in figure 7, for the given benchmarks, no significant load

XIV. INTERFACING NECI
The ongoing development of NECI is focused on an efficiently scaling solver for the CIproblem. It is not desirable to reimplement functionality that is already available in existing quantum chemistry codes. Since the CI-problem is defined by the electronic integrals and subsequent methods depend on the results of the CI-step, namely the reduced density matrices, it is easily possible to replace a CI-solver of existing quantum chemistry code with NECI .
To use NECI only an input file and a FCIDUMP file 122 , which is the widely understood file format for the electronic integrals, are required. After running NECI the stochastically sampled reduced density matrices are available as input for further calculations in other codes. It is possible to link NECI as library and call it directly or to run it as external process and do the communication with explicit copying of files. The first alternative will be referred to as embedded, the second is the decoupled form.
Due to the stochastic nature of the Monte Carlo algorithm, it is not yet possible to use NECI as a black box CI-solver for larger systems. In this case it is recommended to use the decoupled form for a better manual control of the convergence. Another advantage of the decoupled form is the combination of NECI with different quantum chemical algorithms or implementations that do not benefit from massive parallelisation as much as NECI . This way it is possible to switch from serial or single node execution to multiple nodes in the CI-step.
The wave function is improved by mixing single excitations to the |0 wave function. As the CASSCF optimization proceeds, the χ pq coefficients decrease until they vanish, and pled to the adaptive shift approach discussed in section V with a great enhancement in performance.

XVI. CONCLUSION
With NECI we present a state of the art FCIQMC program capable of running a large and FCIDUMP_Ne_st_pVDZ integral files and output_file_Ne_st_pVDZ.txt output stats_ file_Ne_st_pVDZ.txt files). All output files contain the corresponding FCIQMC input.

XVIII. DATA AVAILABILITY STATEMENT
The data that supports the findings of this study are available within the article and its supplementary material 131 . The NECI program can be obtained at https://github. com/ghb24/NECI_STABLE, the development version can be obtained from the corresponding author upon reasonable request.

ACKNOWLEDGMENTS
The early development of NECI was supported by the EPSRC under grant numbers EP/J003867/1 and EP/I014624/1.
We would like to thank Olle Gunnarsson, David Tew, Daniel Kats, Aron Cohen, and Vamshi Katakuri for insightful discussions.
The high performance benchmarks discussed in section XIII, were ran on the MPCDF (Max Planck Computing & Data Facility) system Cobra.

Appendix A: Stochastic excitation generation and p gen
In the following appendices we will consider in some detail the process of (random) excitation generation in FCIQMC -a crucial yet rather flexible aspect of the algorithm.
We will consider some general aspects, such as implementation of Abelian symmetries in the excitation process, as well as non-uniform excitation generation, as is often desirable in quantum chemical Hamiltonians. There are other classes of systems (such as Hubbard models, Transcorrelated Hamiltonians, spin models such as Heisenberg systems, etc) for which more specialised considerations are necessary for efficient excitation generation but we will not consider them here.
The first general point about excitation generation, (by which we mean starting from a given determinant |D i we randomly pick either one or two electrons, and a corresponding number of holes to substitute them with, to create a second determinant |D j ), is that if |H ij | > 0, then the probability (p gen (j|i)) to select |D j and |D i , must also be greater than 0. Furthermore, p gen (j|i) must be computable, and in general the effort to do so will depend on the algorithm chosen to execute the excitation process.
Let us discuss in more detail the process of stochastic excitation generation, and its impact on p gen . Suppose we are simulating a system of n electrons in 2N spin orbitals {φ 1 , ..., φ 2N }. A given determinant |D i can be defined by its occupation number representation, I = |n 1 , ..., n 2N , which is a binary string such that n i = 1 if orbital i is occupied ('an electron in |D i '), and n i = 0 if it is unoccupied ('a hole in |D i '). Each orbital carries a spin quantum number σ(φ i ), and may also carry a symmetry label, Γ(φ i ). These are both discrete symmetries, with σ = ±1/2, and Γ = Γ 1 , ..., Γ G , where G is the number of irreducible representations available in the point-group of system under consideration. We will only consider Abelian groups, so that the product of symmetry labels uniquely specifies another symmetry label. This simplifies the task of selecting excitations, although it does not necessarily exploit the full symmetry of the problem.

Uniform excitation generation
Now we wish to perform a stochastic excitation generation, which we will initially consider without the use of any symmetry/spin information. For example, we can select a pair of electrons, i, j (with i < j) in |D i , at random, and a pair of holes a, b (with a < b), and perform the transition ij → ab. The corresponding matrix element is We will denote the electron pair simply as ij and the hole pair ab.
For this simple procedure, it is clear that the probability to choose |D j from |D i is simply: from which it follows that p gen (j|i) ∼ (nN) −2 . This procedure does not take symmetry or spin quantum numbers into account, and it is quite possible that the corresponding Hamiltonian matrix element is zero. To ensure that we do not generate such excitations, we need to select the hole pairs so that following two conditions are met: These restriction greatly impact the way in which we will select i, j and a, b, and the resulting generation probability.

a. Imposing symmetries via conditional probabilities
One way to impose symmetries in excitation generation while keeping track of the generation probabilities is via the notion of conditional probabilities. For example, rather than drawing (ij) and (ab) independently, with probability p(ab, ij) = p(ab)p(ij), one can instead draw (ab) given that one has already drawn (ij); the probability for this process is given by where p(ij) is the probability to select (ij) in the first place. If (ij) has a particular characteristic that confers a physical (e.g. symmetry-related) constraint on (ab), this can be implemented at the stage in which we select (ab): (ab) need only be selected from among those hole-pairs for which the constraint is satisfied. For example, if the electrons (ij) have opposite spins then the holes (ab) must also have opposite spins. The smaller number of possibilities in choosing the ab pair then leads to a larger p(ab|ij) compared to p(ab), which can be thought of as a renormalisation of the latter probability to take into account the constraint.
The concept of conditional probabilities can be further extended so that the pair (ij) itself is made to satisfy a particular condition. Suppose we introduce a set of of conditions ..} such that the union of all such conditions is exhaustive. It is possible to draw conditional probabilities with respect to such conditions. For example, p(ab, ij) = p(ab, ij|C 1 )p(C 1 ) + p(ab, ij|C 2 )p(C 2 ) (A8) with p(C 1 ) + p(C 2 ) = 1.
(A9) p(C 1 ), the probability to select same-spin excitations, can be chosen arbitrarily, which then fixes p(C 2 ) according to the above.
The advantage of this formulation is that we can skew the selection of electron pairs, for example, towards opposite spin excitations if that proves advantageous, and to be able to compute the resulting probabilities. Furthermore, we can write: which allows us to select a pair of electrons satisfying condition C 1 , and subsequently draw a pair of holes given one has selected an electron pair with the same spin (which implies that hole-pair must be chosen to have the same spin as the electron-pair).

Cauchy-Schwartz excitation generation
Let us now consider how to generate the hole pairs in a non-uniform manner, to reflect the fact that, in ab initio Hamiltonians, the matrix elements vary strongly in magnitude. Since the spawning probability is proportional to the ratio |H ij |/p gen (j|i), it is clearly desirable to generate excitations which make this ratio as uniform as possible, ideally with p gen (j|i) ∝ |H ij |. In this way, one would ensure a relatively uniform probability of successful spawning, which ideally would be close to one, implying a low rejection rate. Keeping the discussion focussed on double excitations (the generalisation to single excitations being straightforward) the question that arises is: how best can one select ij and ab such that p gen (j|i) ∝ |H ij | to a good approximation, and p gen remains exactly computable without excessive cost. We will see there is a compromise to be made. One can ensure precise proportionality between p gen (j|i) and |H ij | but only at prohibitive cost. Alternatively, one might be able to select ij and ab to effect the transition |D i → |D j based on computationally inexpensive heuristics, to provide approximate proportionality, which will nevertheless allow for a large overall improvement in efficiency.
To ensure exact proportionality between p gen (j|i) and |H ij | it is necessary to enumerate all electron-pairs and hole-pairs which are possible from |D i , and to construct the cumulative probability function (CPF), from which the desired distribution can be straightforwardly sampled. The (unnormalised) CPF is: In this expression, the sum over ee ′ runs over all enumerated electron pairs in D up to ij, and similarly for the hole-pairs (up to ab). The CPF is a non-decreasing function of its discrete arguments, and its inverse transform enables one to select ab and ij with probability proportional to | ij||ab |. From the point of generation probabilities, this is the ideal excitation generator, allowing for a uniform spawning probability (which can be made to equal unity, implying zero rejection rates.) Unfortunately the CPF costs O(n 2 N 2 ) to set up (for each determinant |D i ), making it prohibitive in practice.
To make practical progress, we need an approximate distribution function which is much cheaper to calculate. Two observations can be made in this relation. First, if the two electrons have different spin, then the Hamiltonian matrix element consists of only one rather than two terms. This is because upon excitation ij → ab, the two holes must match the spins of the two electrons. For example σ(a) = σ(i) and σ(b) = σ(j). In this case, the Hamiltonian matrix element reduces to: with the exchange term ij|ba = 0.
With this simpler matrix element, we now ask: given that we have chosen an electron pair ij, how can we select the hole pair ab so that, with high probability, the resulting matrix element ij|ab is large? At this point we can appeal to the Cauchy-Schwarz inequality, which provides a strict upper bound: This suggests that, as long as ij|ab is non-zero by symmetry, it may be advantageous to select the hole a so that ii|aa is large, and the hole b so that jj|bb large. Because i and j have different spins, the selection of a and b will be independent of each other, with a for example being chosen from the α-spin holes available, and b from the β-spin holes. To do this, we set up two CPFs: where the sums over h runs over the α or β holes in D.
In practice, we must draw two holes a and b from the same set of holes, avoiding the possibility of drawing the same hole twice. Because we would like to avoid setting up a twodimensional CPF (which would cost O(N 2 )), we create one one-dimension CPF in order to draw hole a, and then remove this hole in the CPF before drawing the second hole. In other words we set up two related CPFs drawing hole a from F a and hole b from F ′ b . Our exploration of excitation generation has led us to discover many highly performing schemes. The Cauchy-Schwarz (CS) scheme presented above is a good starting point, but it has a number of weaknesses that can be further addressed. In particular, as noted above, the upper bound obtained is particularly poor for double excitations with the same spin, and in general the specified bound can be too loose. Fortunately, the selection of the second hole, b, is made once the first hole, a, has already been chosen, and as such the exact double excitation Hamiltonian matrix elements can be used at this stage, such that an updated CPF for selecting the second electron is given by This Part-Exact (PE) scheme no longer provides a strict bound, but by better representing the cancellation of terms present in these matrix elements, it provides a substantially better approximation. More crucially, it improves the prediction of the elements that were previously handled least effectively, and thus relaxes the time-step constraints on the overall calculation.
Due to the increase in computational cost involved in constructing two lists, and the additional normalisation of the probabilities required by causing the two selections not to be made in the same manner, this update to the scheme increases computational cost per iteration. In almost all systems examined this is far outweighed by the time-step changes, especially in systems with large basis sets or with translational symmetry. However, it is possible to find systems where the pure CS scheme is more optimal.

a. Preparing for excitation generation
For determinant D, to pick an excited determinant, first construct a table of hole occupancies for each spin and irreducible representation, so that n σΓ [D] is the number of holes with spin σ in irrep Γ available in D. This is an O(n) process.
We next decide whether we wish to make a single excitation or a double excitation from D. A single excitation is chosen with probability p sing , a parameter which can be optimised as the simulation proceeds to maximise the acceptance ratio and time-step of the simulation.
The probability to create a double excitation is chosen such that the maximal ratios |H ij | pgen(j|i) for single and double excitations are equal, which for ab initio systems typically means double excitations dominate. To a first approximation p sing = nN/(nN + n 2 N 2 ), which is in general a small number on the order of (nN) −1 . The probability of attempting a double excitation is then p doub = 1 − p sing .

b. Single Excitations
If a single excitation is being attempted, first select an electron (say i) at random, with probability n −1 . The spin σ = σ(i) and irrep Γ = Γ(φ i ) of the electron determines the spin and irrep of the hole.
To select the hole a, run over all n σΓ holes available in D with spin and symmetry σΓ, and compute the (unnormalised) cumulative probability function, where |D h i is a single-excitation i → h from |D , and D h i |Ĥ|D is the Hamiltonian matrix element between them. The normalisation of the CPF is give by the last element in the array: where n σΓ is the number of holes available with spin σ in irrep Γ in D. Using F (1) a , select hole a (with probability | D a i |H|D |), by inverting the CPF. This is selected by generating a uniform random number ξ in the interval [0, Σ i ), and determining the index of a such that the condition is met. The overall generation probability for this excitation is: where This completes the selection of a singly excited determinant. The computation of F

c. Double Excitations with opposite spin electron-pairs
If a double excitation is being attempted, then firstly a pair of electrons needs to be selected. The first electron, i, should be selected uniformly at random. Following this, the should be constructed, where p opp is a optimisable biasing factor towards excitations with electrons having opposite spins. The second electron is selected through inversion of the CPF.
If the two selected electrons have opposite spins, then the first hole to be chosen is, by convention, always a β electron, and the second hole always α. This choice is entirely arbitrary, and in some high-spin systems it may make sense to reverse this selection. Considering all available orbitals of this spin, the CPF is constructed, where i is taken to be the electron from the selected pair with β spin, and the hole selected by inverting the CPF.
Once this first electron has been chosen, the symmetry of the target orbital is now fixed by the constraint that Γ a ⊗ Γ ′ = Γ i ⊗ Γ j . This greatly restricts the number of holes that must be considered when constructing the final CPF, Note that with the conventional choice of orbital i above, ji|ah = 0, and can thus be excluded. The second hole is then also obtained by inverting the CPF. The generation probability is then given by: p gen (ab, ij) = p(ij)p(a|i)p(b|ija)p doub , (A32) where Σ i , Σ (β) (i), Σ (αΓ ′ ) (ija) are the normalisations of F j , F The asymetric selection of α and β holes is somewhat peculiar. It should be noted that it is possible to make this selection symmetrically, considering all available holes in the selection of the first hole, and then renormalising the probabilities to account for the possibility of selecting b first. The symmetric scheme increases computational cost substantially (twice as many holes need to be considered in the CPF, and a further CPF must be calculated for the renormalisation). It also makes the overall time-step behaviour worse as, although it improves the general smoothness, for the worst-case scenario with a very rarely selected excitation with very different a, b and b, a probabilities, the denominator is increased substantially by considering more orbitals, whilst leaving the numerator essentially unchanged.

d. Double excitations with same-spin electron pairs
If the pair of electrons, selected as described above, have the same spin the process needs to account for the fact that the holes can be selected in either order, and the probabilities need to be adjusted to compensate. Now, considering only holes with the same spin as the two electrons, construct the CPF Hole a can then be selected through inversion of this CPF, which fixes the symmetry of hole b such that Γ a ⊗ Γ b = Γ i ⊗ Γ j . The CPF for selecting the second hole can then be constructed from the (much smaller) set of holes with the appropriate symmetry, such that The second hole, b, can then be selected through inversion of this CPF. It is important to note that as the selection of the first hole includes all holes of the hole with the given spin, the selection of the holes could have been made in the reverse order, and this needs to be taken into account in the generation probability, which is given by: p gen (ab, ij) = [p(a|ijb)p(b|ij) + p(b|ija)p(a|ij)]p(ij)p double , where b are the normalisations of the three CPFs, given by their final elements. Note that in the implementation, the normalisations of four CPFs must be calculated to be able to calculate p(a|ijb) as well as p(b|ija).

Pre-computed heat-bath sampling
While the Cauchy-Schwartz excitation generator has negligible memory cost, picking an excitation requires O(N) steps, each involving Hamiltonian matrix elements, making the procedure expensive. The pre-computed heat-bath algorithm employed in NECI is a simple approximation derived from the heat-bath sampling 32 and offers a much faster excitation generation, at the cost of increased memory requirement. The heat-bath probability distribution can also used to determine a cutoff in a deterministic scheme, leading to the heat-bath CI (HCI) method 31 . The sampling can either use uniform single excitations or the weighted scheme outlined in section A 2 b, and approximates the exact heat-bath sampling of doubleexcitations by uniformly picking the occupied orbitals, and then picking two target orbitals simultaneously weighted with the Hamiltonian matrix element. Since the double excitations play the largest role in excitation generation, and the singles' matrix elements depend on the determinants in addition to the excitation, it is typically most efficient to generate only double excitations in a weighted fashion, resulting in an excellent tradeoff between optimal weights and the cost of excitation generation.
To create a double excitation using pre-computed heat-bath generation, first, two occupied orbitals i, j are chosen uniformly at random, using a bias towards spin-opposite exctitations, which is determined similar to the bias towards double excitations. This works analogously to the Cauchy-Schwartz excitation generation outlined in section A 2. Then, a pair a, b of orbitals is chosen using pre-computed weights where H ab ij is the matrix element for a double excitation from orbitals i, j to orbitals a, b. These are independent of the determinant and thus can be pre-computed at memory cost O(M 4 ). Then, pairs of orbitals can be picked using these weights via alias sampling 132 in O(1) time. If one of the picked orbitals a, b is occupied, or all matrix elements H ab ij are zero, the excitation is immediately rejected, otherwise, we continue with the FCIQMC scheme.
As it is desirable to use spatial orbital indices to save memory, but the matrix element depends on the relative spin of the orbitals in the case of a spin-opposite excitation since it determines if an exchange integral is used, for each pair of spatial orbitals i, j, three probability distributions are generated, one for the spin-parallel case, one for the spinopposite case without exchange and one for the spin-opposite case with exchange. Between the latter two, we then choose the exchange case with probability p exch (ij) = ab |H aβbα iαjβ | ab |H aβbα iαjβ | + |H aαbβ iαjβ | .

(A49)
The denominator is the same as the denominator in Eq. (A48) for spin orbitals, while the numerator is the denominator in Eq. (A48) for spatial orbitals in the exchange case. The bias p exch hence relates the spatial orbital distributions to the original distribution (A48).
This approach is tailored for rapid excitation generation, as the process is in principle