Multi-level methods and approximating distribution functions

Biochemical reaction networks are often modelled using discrete-state, continuous-time Markov chains. System statistics of these Markov chains usually cannot be calculated analytically and therefore estimates must be generated via simulation techniques. There is a well documented class of simulation techniques known as exact stochastic simulation algorithms, an example of which is Gillespie's direct method. These algorithms often come with high computational costs, therefore approximate stochastic simulation algorithms such as the tau-leap method are used. However, in order to minimise the bias in the estimates generated using them, a relatively small value of tau is needed, rendering the computational costs comparable to Gillespie's direct method. The multi-level Monte Carlo method (Anderson and Higham, Multiscale Model. Simul. 10:146-179, 2012) provides a reduction in computational costs whilst minimising or even eliminating the bias in the estimates of system statistics. This is achieved by first crudely approximating required statistics with many sample paths of low accuracy. Then correction terms are added until a required level of accuracy is reached. Recent literature has primarily focussed on implementing the multi-level method efficiently to estimate a single system statistic. However, it is clearly also of interest to be able to approximate entire probability distributions of species counts. We present two novel methods that combine known techniques for distribution reconstruction with the multi-level method. We demonstrate the potential of our methods using a number of examples.

time Markov chains. System statistics of these Markov chains usually cannot be calculated analytically and therefore estimates must be generated via simulation techniques. There is a well documented class of simulation techniques known as exact stochastic simulation algorithms, an example of which is Gillespie's direct method.
These algorithms often come with high computational costs, therefore approximate stochastic simulation algorithms such as the tau-leap method are used. However, in order to minimise the bias in the estimates generated using them, a relatively small value of tau is needed, rendering the computational costs comparable to Gillespie's direct method.
The multi-level Monte Carlo method (Anderson and Higham, Multiscale Model. Simul. 10:146-179, 2012) provides a reduction in computational costs whilst minimising or even eliminating the bias in the estimates of system statistics. This is achieved by first crudely approximating required statistics with many sample paths of low accuracy. Then correction terms are added until a required level of accuracy is reached. Recent literature has primarily focussed on implementing the multi-level method efficiently to estimate a single system statistic. However, it is clearly also of interest to be able to approximate entire probability distributions of species counts.
We present two novel methods that combine known techniques for distribution reconstruction with the multi-level method. We demonstrate the potential of our methods using a number of examples. a) Electronic mail: daniel.wilson@dtc.ox.ac.uk

I. INTRODUCTION
Biochemical reaction networks describe interactions between the molecular populations of biological and physiological systems. A classical deterministic approach to modelling these networks is to use systems of ordinary or partial differential equations to describe the evolution of the concentrations of each species. However, experimental researchers, such as Elowitz et al. 2 , have shown that stochasticity can be observed in a variety of biological phenomena including gene expression within the cell. Thus, deterministic modelling can neglect important characteristics of a system. For example, in systems with low molecular populations using a deterministic model to describe the molecular concentrations can fail to predict a phenomenon known as stochastic focussing in signalling networks 20 . Deterministic approaches can also fail to accurately replicate the dynamics of systems with large molecular numbers, such as systems operating near bifurcation points 4 .
To include stochasticity into models of biochemical reaction networks one can consider the framework of the chemical master equation 10 . This assumes spatial homogeneity in the system, which is the case we shall consider in this paper. The chemical master equation describes the temporal evolution of molecular species in terms of the number of molecules present rather than their concentrations. This description provides a system of ordinary differential equations (ODEs, one for each possible state of the system) that, for simple cases involving only zeroth and first order reactions, can be solved analytically 10 . For more complicated systems that include higher order reactions analytical solutions are not generally feasible so one may try and use a numerical scheme to approximate the solution to the chemical master equation 3,9,10 . However, this is also often infeasible due to the high (possibly infinite) number of distinct states in the system, so the only option to explore system behaviour is the use of simulation techniques.
There is a class of simulation techniques known as exact stochastic simulation algorithms (eSSAs) which model systems precisely, in the sense that the sample paths generated using them are mathematically equivalent to the chemical master equation. However, the more complex the system, the longer these simulations take to run, and for practical purposes approximate stochastic simulation algorithms (aSSAs) that favour speed over accuracy are often used. Unfortunately, obtaining accurate summary statistics for a system using aSSAs often renders the computational costs of those algorithms on a par with those of eSSAs.
Recently proposed by Anderson and Higham 1 is the discrete-state multi-level method.
This method vastly reduces the computational cost associated with generating summary statistics for a given system compared with traditional eSSAs and aSSAs. It does so by generating many sample paths of poor accuracy with very little computational effort and then reduces the error of summary statistics generated using these sample paths by adding correction terms from a smaller number of sample paths of higher accuracy. The paper by Lester et al. 13 provides a clear introduction to the multi-level method with the focus being on implementation. In this work, we show how to exploit the multi-level method to accurately and efficiently estimate cumulative distribution functions (CDFs) of molecular species at a terminal time, T .
In Section II we briefly describe the basic simulation techniques that will be used throughout the paper, namely Gillespie's direct method 7 (Gillespie's DM) and the tau-leap method 8 .
We also introduce the multi-level method, stating how to optimally choose the number of sample paths on each level and simulate these paths with low sample variance. Section III focusses on two methods of approximating the CDF of a chosen molecular species at time T , where we use the Schlögl system 16 and a dimerisation model 1 as example cases. Then we combine these two approximation methods with the multi-level method. We investigate how to optimise implementation of our proposed methods, resulting in two adaptive algorithms designed to use the multi-level method efficiently without any prior knowledge of the system. We conclude, in Section IV, with a short discussion.

NETWORKS
Let S 1 , . . . , S N represent N distinct populations or species that may be involved in M reaction channels, R 1 , . . . , R M , in a system with volume ν. We assume that the system is well stirred in order that we may ignore any spatial effects in our model. Denote the size of population S i at time t as X i (t). We define the state space vector at time t as We also introduce the stoichiometric or state-change matrix, ν = {ν ij }, where ν ij is the change in the copy number of S i after the reaction channel R j fires. Thus ν is an N × M matrix. If reaction R j fires at time t + s given no other reactions in [t, t + s), we update the state space vector as follows where ν j is the stoichiometric vector, defined as the j th column of the stoichiometric matrix, Each reaction R j has a propensity function a j (X(t)) at time t, this quantity determines the rate at which the j th reaction occurs. The propensity functions, a j (X(t)), can be calculated by considering the number of ways in which a reaction can occur given the copy numbers at time t. A detailed description can be found in the paper by Erban et al. 5 .
The Kurtz representation 12 allows us to describe evolution of the state space vector as follows: where the Y j 1 , j = 1, . . . , M , are Poisson processes of unit rate.

A. Gillespie's direct method
Gillespie's DM 7 is a standard example of an eSSA for generating sample paths from (4) We include a step by step algorithm below: 1. Initialise the copy numbers, X(0), and the stoichiometric matrix, ν. Choose a terminal time, T , and set t = 0.
4. If t + ∆ > T then terminate the algorithm, otherwise continue to step 5.
5. Choose the next reaction to occur by finding the minimal k such that k j=1 a j > a 0 ×r 2 , where r 2 ∼ U (0, 1). 6. Update the copy numbers, X(t + ∆) = X(t) + ν k . 7. Let t = t + ∆ and return to step 2.
The main advantage of this algorithm is that it can be used to produce unbiased estimates of system statistics at a terminal time, T . As an example, for the mean copy number of S i we generate n sample paths to calculate where X (r) i represents the copy number of S i from the r th sample path. The disadvantage of this eSSA is that it simulates every reaction individually, which can render it very computationally expensive. In order to reduce computation time, we now introduce a more efficient algorithm, known as the tau-leap method, which provides sample paths that approximately follow (4).

B. The tau-leap method
The tau-leap method is an aSSA, first proposed by Gillespie 8 . It is designed to reduce computation time by temporarily fixing the propensity functions and firing several reactions during each time interval of length τ . The method is equivalent to taking a Forward Euler approximation 19 to equation (4), A step by step algorithm is again presented below: 1. Initialise the copy numbers, X(0), and the stoichiometric matrix, ν. Choose a terminal time, T , and a time step, τ , such that T /τ is an integer. Set t = 0.
2. If t + τ > T then terminate the algorithm, otherwise continue to step 3.
3. Calculate the propensity functions, a j = a j (X(t)), for each reaction channel, j.

Update the copy numbers
6. Let t = t + τ and return to step 2.
An attractive feature of the algorithm is that the accuracy of the approximation is determined by the choice of τ . The smaller the time step, τ , the greater the accuracy of the method, and indeed as τ ↓ 0 the method is equivalent to Gillespie's DM 8 . However, often the value of τ needed to gain the required level of accuracy is small enough that the CPU time becomes comparable to that of Gillespie's DM, and so for further computational savings we utilise the multi-level Monte Carlo method 1,6 .

C. The multi-level Monte Carlo method
In this section we provide a very brief introduction to the multi-level Monte Carlo method 1,6 . Heuristically, the method combines the best attributes of Gillespie's DM and the tau-leap algorithm, to produce fast and unbiased estimates of system statistics. Further information can be found in the work by Anderson and Higham 1 and Giles 6 , as well as Lester et al. 13 .
Suppose we are interested in calculating an estimate of X i (T ). We start at the base level, otherwise known as level 0. The estimate at this level is calculated using the tau-leap algorithm, as discussed in the previous section, with τ = τ 0 . If n 0 samples are generated, then we have the biased estimate, where Z

(r)
τ 0 is the copy number of population i at time T from the r th sample using the tau-leap algorithm with τ = τ 0 . Typically, a large τ 0 is chosen so that Q 0 can be calculated quickly. The purpose of the other levels is to reduce the bias of this crude approximation, Q 0 , by adding correction terms.
The first level correction is calculated by generating two sets of n 1 sample paths using the tau-leap aSSA. One set has τ = τ 0 and the other τ = τ 1 < τ 0 . From the two sets of paths, we calculate the correction term as follows: 6 This correction is then added to the initial estimate from the base level: This estimator has bias equivalent to just using the tau-leap aSSA with τ = τ 1 . To further decrease the bias of the estimate we introduce time steps τ < τ −1 < . . . < τ 0 , and then add additional correction levels, to give This means that the estimator Q b has bias equivalent to that generated using the tauleap aSSA with τ = τ L . The number of levels, L, can be chosen such that a desired level of accuracy is reached. However, a final correction level may be used in order to gain an unbiased estimate, if required. Again we generate two sets of paths, one set from a tauleap algorithm where τ = τ L and one set from an eSSA such as Gillespie's DM, using n L+1 samples. We can then calculate where X i is an identical and independently distributed (i.i.d.) copy of X i . This last correction can then be added to the previous estimate and, using the linearity of expectation, we have For the rest of this paper we shall choose τ = τ 0 /K where K ∈ N + . This means that the time steps used on consecutive levels are nested and renders the algorithm simple to implement. However, we note that it is not always the most appropriate choice: when dealing with systems that exhibit behaviours on a range of timescales a more sophisticated choice is necessary 14 .
The reduction in computational cost for the multi-level method is achieved by introducing two coupling techniques. A tau-leap coupling technique for the sets of sample paths on the first L levels, and an exact coupling technique for the (L + 1) st level. The coupling is designed to create a positive correlation between coarse, Z (r) τ −1 , and fine, Z (r) τ , sample paths so that Var Z τ − Z τ −1 is reduced and fewer sample paths need to be generated to obtain an estimator with a desired accuracy. The following algorithm is the tau-leap coupling technique: 1. Initialise the copy numbers for the coarse and fine paths, Z c (0) and Z f (0), and the stoichiometric matrix, ν. Choose a terminal time, T , base level leap size, τ 0 , and scaling factor, K. Set t = 0.
2. For α = 1 to T /τ −1 : (a) Calculate the propensity functions at the current time t for the coarse system, a c j = a j (Z c (t)), for each reaction channel, j.
(b) For β = 1 to K: i. Calculate the propensity functions at the current time t for the fine system, ii. Define the following three propensity functions: for each reaction channel, j.
iv. Update the copy numbers of the two systems as follows: Let t = t + τ .
Further details, including those detailing the exact coupling technique, can be found in the paper by Lester et al. 13 .
Finally, we need an algorithm to evaluate how many paths, n , to generate on each level, , such that we minimise the expected CPU time of the method whilst controlling the variance, V = L =0 V (the unbiased case can be considered analogously), where V is the estimator variance on the th level, of the summary statistic of interest (here the first moment E (X i )) to within a user prescribed tolerance, .
If a pair of paths on the th level takes c units of CPU time to be generated, we wish to subject to This optimisation problem can be solved using the method of Lagrange multipliers 13 and suggests that we should take where . represents the ceiling function. One means to estimate c and V involves generating an initial number of sample paths on each level (usually 100 or 1000) 1 .

III. RECONSTRUCTING CUMULATIVE DISTRIBUTION FUNCTIONS
In this section we introduce two methods to approximate CDFs of discrete state systems and then how to implement these methods efficiently alongside the multi-level method. We demonstrate the utility of our approach using two explanatory examples.

A. Using a maximum entropy approach
We first consider approximating the CDF of species If m = ∞ we select an integer N at which to truncate the state space, so that we work with The optimisation problem is now to find the PDF, p, which maximises the entropy function (22) subject to the moment constraints for j ∈ {0, . . . , M }. Note we include µ 0 , and set µ 0 = 1 as we require N i=1 p i = 1. The solution is obtained by introducing M + 1 Lagrange multipliers 18 , λ = (λ 0 , . . . , λ M ), and the Lagrange functional where c = λ 0 − 1. By solving ∂Γ/∂p i = 0, we obtain the solution for i ∈ {1, . . . , N }. Now that we know the form of the solution, which has M + 1 unknowns, λ = (λ 0 , . . . , λ M ), we have a system of M + 1 nonlinear equations, for j ∈ {0, . . . , M }. We need to employ a numerical method to calculate λ = (λ 0 , . . . , λ M ), the one used in this paper is adapted from work by Mohammad-Djafari 15 . The CDF can then be trivially calculated.
We now present possible ways of efficiently implementing this method in combination with the multi-level method in order to estimate CDFs of given species numbers. To aid us we use two example models, the Schlögl model and a dimerisation model.

The Schlögl model
The Schlögl model 16,21 consists of a single species, A, and four possible reactions: We fix the rates to be [k 1 , k 2 , k 3 , k 4 ] = [2200, 37.5, 0.18, 0.00025] for the remainder of this paper. We chose the initial condition A(0) = 0 and terminal time T = 20. The first question we need to answer is how many moments are necessary to approximate the CDF of A to within a desired degree of accuracy. To gain insight into this problem, we generated 10   The multi-level method is used to generate the moments required by the maximum entropy method. In practice we do not want to rely on results from an eSSA to calculate the Kolmogorov-Smirnov distance so instead we assume that the PDFs on the left of Figure 1 converge to the empirical distribution. We can therefore choose the number of moments, i, to be the minimal i such that D P i ,P i−1 < δ, where δ is chosen to provide the desired level of accuracy and P i represents the CDF calculated using the first i moments in the method of maximum entropy.
The question we now need to answer is which moment should have its estimator variance controlled when choosing the number of paths on each level in the multi-level method. Recall in Section II we controlled the estimator variance of the first moment as this was the only system statistic of interest. However, now we have multiple system statistics. We choose to control the accuracy of every moment, E(X k ), by controlling its variance, Var X k , to within · E X k for some > 0. We now present an adaptive algorithm to generate an approximate CDF using the multi-level method together with the method of maximum entropy: 1. Set the number of levels L and the number of initial moments m ≥ 2. Set the variance control parameter, , and the tolerance parameter for the Kolmogorov-Smirnov distance test, δ.
2. Generate 1000 initial sample paths on each level using the multi-level algorithm.
5. Let P 1 be the CDF generated using the maximum entropy method using m−1 moments, and P 2 the CDF using m moments. If D P 1 ,P 2 < δ terminate the algorithm, otherwise let m = m + 1 and return to step 3.
This adaptive algorithm systematically increases the number of moments used in the entropy calculation until the addition of a new moment has negligible effect on the resulting CDF, which is determined by the tolerance parameter, δ. The algorithm relies on a convergence assumption, i.e. increasing the number of moments indefinitely will yield CDFs that converge to the true CDF. We note that, however, in practice, the algorithm is limited by the robustness of the numerical scheme chosen to calculate the CDFs.
In Figure 2 we present for the Schlögl model: the CDF, P (x), generated using the adaptive algorithm with = 0.03 and δ = 0.02; and the empirical CDF, F (x), generated using Gillespie's DM which took 4568 seconds and 38853 seconds of CPU time to compute, respectively.
The adaptive algorithm selected six as the optimal number of moments. As our next example we consider gene expression for a single gene 1 (G). It transcribes mRNA, known as messengers (M ), which can, in turn, translate proteins (P ). Pairs of these proteins may combine to form a dimer (D). The mRNA and proteins may also degrade. This biochemical network is described by the following set of reactions 1 : We fix the rates to be [k 1 , k 2 , k 3 , k

B. Using an indicator function approach
We now consider another method to approximate CDFs. Let X be a distribution on a discrete state space Ω. We can approximate the CDF of X directly by estimating the following statistics, for i ∈ Ω. Although we refer to the collection of estimates c i , for i ∈ Ω, as an approximate CDF it is important to note that this method does not always generate a true CDF because estimates are not necessarily monotonic increasing i.e. c i ≤ c i+1 for {i, i + 1} ∈ Ω. For two solutions to this problem see Section IV.
These statistics, c i , can be estimated using the multi-level method for which we control the variance of each of the c i , requiring, as before, Var(1{X ≤ i}) < for some > 0. One option is then to generate N = max i∈Ω {n i } paths on each level, where n i is the number of paths required on the th level to control the variance of c i . However, this requires large CPU times. Instead we introduce I where I ⊂ Ω. We generateN = max i∈I {n i } on each level and then calculate the estimates, c i , for i ∈ Ω. Then we iteratively increase the size of the subset I until further increases have only negligible effects on the approximate CDFs produced. We update I by finding the maximal set J ⊂ I where each element j ∈ J corresponds to the estimate c j that required the largest number of paths on some level, , of the multi-level method. Then, for each j ∈ J, we add new entries to I as the midpoints of j and the nearest neighbours of j that are already in I. We now present an algorithm for the adaptive method and the indicator function approach: 1. Set the number of levels L and the variance tolerance parameter, , for all the estimates c i . Set the tolerance parameter, δ, for the Kolmogorov-Smirnov distance test. Set the initial subset I ⊂ Ω from which the number of paths to be generated will be decided.
2. Generate 1000 initial sample paths on each level using the multi-level method.
3. Calculate the number of additional paths, n i , necessary on each level, , to ensure Var (1 {X ≤ i}) < for each i ∈ I using equation (21).
4. GenerateN paths on each level, , whereN is given by, using the multi-level method.
5. Let P 1 be the approximate CDF generated by estimating c i for each i ∈ Ω.
6. Construct the maximal set J such that for every j ∈ J, n j =N for some level . Then for every j ∈ J add additional entries to I as the midpoints between j and its direct neighbours. If I cannot be updated further, i.e. there is no midpoint not already in I, then terminate the algorithm.
7. Calculate the number of additional paths, n i , necessary on each level to ensure Var (1 {X ≤ i}) < , for each i ∈ I using equation (21).
8. GenerateN paths on each level, , whereN is given by, using the multi-level method.
9. Let P 2 be the approximate CDF created by calculating c i for each i ∈ Ω.
10. If D P 1 ,P 2 < δ, terminate the algorithm, otherwise let P 1 = P 2 and return to step 6.
We note that c i ∈ [0, 1] for every i ∈ Ω and so we do not scale the variance tolerance of each estimate by c i , instead we select a single variance tolerance, , for all estimates. We now use this adaptive algorithm on the two example systems we have seen previously.

The Schlögl model
We consider the molecular species A and the reactions (27). We truncate the state space to

IV. DISCUSSION
We have presented two novel algorithms designed to employ the multi-level Monte Carlo method to generate approximate CDFs for molecular populations at a terminal time T . The first was the method of maximum entropy which used the multi-level method to generate a finite set of moments and from these moments calculated an approximate distribution. We note that this method is favourable if an analytical approximation of the PDF is required.
The second method was an indicator function approach. This adaptive method focussed on generating more paths to control the estimates of c i in regions of Ω where the random variables 1{X ≤ i} have a higher variance. These regions are often found to be near the peaks of the corresponding PDFs which is where the method of maximum entropy returns the largest errors. Thus we see that the indicator function approach achieves smaller Kolmogorov-Smirnov distances than the method of maximum entropy when using the Schlögl model.
However, we note that the indicator function approach does not necessarily provide a CDF as the estimates c i are not always monotonic increasing in i. Although the estimates can still be used to give an accurate and meaningful description of the distribution, if a physical CDF is required there are at least two solutions. Firstly one can vastly decrease the value of the variance control parameter to reduce the probability of non-montonicity, this however will significantly increase the CPU time of the method such that it is comparable with the tau-leap method and Gillespie's DM. Secondly one can take estimates for c i from a coarse subset S ⊂ Ω, and generate the CDF using polynomial interpolation whilst enforcing a positive gradient over the range of Ω. This second approach uses a coarser subset S to reduce the probability of non-monotonicity without the further increase in the CPU time.
Future work will involve investigating the potential for improvements on the efficiency of the maximum entropy method, as well as improvements in the efficiency of using multi-level methods to produce approximate CDFs in general.