Model reduction for stochastic chemical systems with abundant species

Biochemical processes typically involve many chemical species, some in abundance and some in low molecule numbers. Here we first identify the rate constant limits under which the concentrations of a given set of species will tend to infinity (the abundant species) while the concentrations of all other species remains constant (the non-abundant species). Subsequently we prove that in this limit, the fluctuations in the molecule numbers of non-abundant species are accurately described by a hybrid stochastic description consisting of a chemical master equation coupled to deterministic rate equations. This is a reduced description when compared to the conventional chemical master equation which describes the fluctuations in both abundant and non-abundant species. We show that the reduced master equation can be solved exactly for a number of biochemical networks involving gene expression and enzyme catalysis, whose conventional chemical master equation description is analytically impenetrable. We use the linear noise approximation to obtain approximate expressions for the difference between the variance of fluctuations in the non-abundant species as predicted by the hybrid approach and by the conventional chemical master equation. Furthermore we show that surprisingly, irrespective of any separation in the mean molecule numbers of various species, the conventional and hybrid master equations exactly agree for a class of chemical systems.


I. INTRODUCTION
The chemical master equation (CME) is the accepted stochastic description of chemical reaction systems [1]. Since intrinsic noise roughly scales as the inverse square root of the mean number of molecules [1], it follows that the CME provides a more accurate description than deterministic rate equations (REs), when species are in low concentrations. However, exact solution of the CME has proved to be impossible for all but the simplest systems (see for example [2][3][4][5]), and Monte Carlo simulations using the stochastic simulation algorithm (SSA) [6] are also time-consuming in many cases of interest. One way to bypass these issues is to use a hybrid model which treats some parts of the system using the SSA and the rest using a simulation method which is computationally more efficient. A common example of such hybrid modelling utilises time scale separation whereby some reactions occur on a fast timescale and are modelled using continuous approaches such as REs or chemical Langevin equations, while the rest of reactions occur on slow timescales and are modelled using the standard SSA [7][8][9]. Other methods which enable a considerable improvement over the SSA when time scale separation is present are: the nested SSA [10], a coarse-grained equation-free approach [11], the constrained multiscale algorithm [12,13], an approach based on finite-state projection [14,15], the slow-scale SSA [16] and the slow-scale linear-noise approximation [17,18].
In this paper we consider a different type of hybrid model, one which uses a separation in the abundance of species (abundance separation) rather than timescale separation. In particular, we no longer split reactions into fast and slow, but rather categorise species based on how abundant they are. These methods utilise a continuous approach to model the abundant species and a discrete approach to model the less abundant species. While less popular than timescale separation, some hybrid algorithms have been developed to take advantage of this idea (see for example [19][20][21][22]). Stochastic simulations verify that these hybrid models can capture important features of the fully stochastic model. In particular, the model by Hellander and Lotstedt [19] has been shown by Jahnke [21] to be exact for monomolecular systems, i.e., the marginal distributions of non-abundant species in the hybrid model are exactly the same as the same obtained from the full stochastic model. More sophisticated (and computationally expensive) approaches have been postulated [21,22] to deal with systems which are not well described by the Hellander and Lotstedt hybrid model.
The advantages of methods using abundance separation over timescale separation are that: (i) the timescales of reactions are often unknown while the abundances are readily measurable; (ii) there is evidence suggesting that abundance separation is at least as significant, if not more, than timescale separation for biochemical networks inside cells. For example it has been shown that the mean number of proteins per E. Coli cell is roughly a thousand times that of the mean number of mRNA molecules per cell [23] -in contrast, the ratio of protein to mRNA lifetime in E. Coli is expected to be above 1 but in the single digits [3]. For mammalian cells, the same has been found; the median number of protein per cell is roughly 3000 times that of median number of mRNA per cell -in contrast, the ratio of median protein to median mRNA lifetime is about 5 [24]. Clearly in these cases, abundance separation is significant while timescale separation is weak, and thus a method which takes advantage of the former appears to be ideal as a means to infer information about the stochastic dynamics of mRNA and of other proteins present in low copy numbers.
In this paper we postulate a novel simple hybrid model based on abundance separation consisting of a CME for the non-abundant species coupled to REs for all species. Subsequently we identify the rate constant limits under which the concentrations of a given set of species will tend to infinity (the abundant species) while the concentrations of all other species remain constant (the non-abundant species). This limit we shall refer to as the abundance or abundant limit. We show that in this limit, the marginal distributions of the non-abundant species given by the hybrid model converge to the same distributions given by the CME of the full system. This fact is particularly useful when the hybrid model can be solved analytically, which is the case in several examples that we study. We illustrate the accuracy of our hybrid model by comparing the exact stochastic simulations of its reduced CME with exact stochastic simulations of the full CME. We also show that there are several chemical systems for which our hybrid model is exact even without abundance separation. In Section 4 we offer an error analysis which provides an easy means to estimate the error incurred by the use of the hybrid model when the ratios of abundant to non-abundant species concentrations are finitely large. We conclude with a Summary and Discussion in Section 5.

II. A REDUCED CHEMICAL MASTER EQUATION
In this section, we first propose a reduced CME which constitutes our hybrid model, and subsequently rigorously prove that it converges to the CME of the full system (that describing all species) in the abundance limit.

A. A heuristic reduction of the CME
The CME for a system of N chemical species which interact through R reactions has the form: where Ω is the volume in which the reactions occur, E x i is the step operator which replaces n i with n i + x, the entries of the state vector n = (n 1 , ..., n N ) are the number of molecules of each species, P (n, t) is the probability of the system being in state n at time t, and the stoichiometric matrix elements S ij are given by the net change in the number of molecules of species i when the jth reaction occurs. The probability that a reaction j occurs in an time interval [t, t + dt) is given by Ωf (n, Ω)dt, wheref j is a function of elements of the state vector and reaction rates. The REs are defined by: where φ = (φ 1 , ..., φ N ) are the concentrations of the N chemical species, and f j ( φ) = lim Ω→∞fj (Ω, Ω φ). We wish to reduce the number of species in this CME from N to M with M < N . Without loss of generality, we will keep species 1 to M , and remove species M + 1 to N , which we consider to be "abundant". We will do this by first summing the CME over n M+1 , ..., n N to leave us with an equation for the time evolution of the exact marginal distribution P * (n ′ , t), where n ′ = (n 1 , ..., n M ), then subsequently we use a mean-field assumption to obtain a time evolution equation for the approximate marginal distributionP (n ′ , t).
We will be considering the different possible forms thatf j can take assuming elementary reactions, specifically up to bimolecular reactions (reactions involving more than three molecules are rare in a biological setting). We will first investigate what happens to the CME if we sum over, say, n h , using the notation that n −h is the state vector n without the h th entry, in other words, n −h = (n 1 , ..., n h−1 , n h+1 , ..., n N ). In what follows, we will use X = (X 1 , ..., X N ) to refer to the vector of chemical species, and Y = (Y 1 , ..., Y N ) to refer to the vector of random variables which give the number of molecules of each species. The state vector n = (n 1 , ..., n N ) therefore refers to a particular realisation of the random vector Y.
If reaction j does not feature X h amongst its reactants, thenf j has no n h dependence and the corresponding term in the CME remains unchanged.
If reaction j is a unimolecular reaction of the type X h → ... thenf j (Ω, n) = k j n h Ω −1 , and we will have: Propensityfj (Ω, n) Reduced propensityfj (Ω, n ′ , φ(t)) kj kj If reaction j is a bimolecular reaction of the type X h + X h → ... thenf j (Ω, n) = k j n h (n h − 1)Ω −2 and we will have: Finally, if reaction j is a bimolecular reaction of the type X h + X i → ... thenf j (Ω, n) = k j n h n i Ω −2 and we will have: These results follow from the definition of conditional expectation: Given the above results, we can now see what will happen to eachf j if we sum over all h = M + 1, ..., N . The result of this summation leads to new propensities involving conditional expectations which we call f ⋆ j . It then follows that the exact marginal CME is given by: where n ′ = (n 1 , ..., n M ). In theory, we have simplified the CME while keeping it exact, but we should be careful because the dependence of the conditional expectations on n ′ is currently unknown. We proceed by making the heuristic mean-field assumption that these conditional expectations can be replaced by deterministic concentrations φ i as defined earlier in Eq. (2), for example, where we have approximated away all conditional dependence. We can correspondingly update the exact effective propensities f ⋆ j with the approximate effective propensitiesf j . A general recipe for convertingf j tof j is given in Table 1.
The approximate marginal CME is now: In the rest of the paper, we will refer to Eq. (1) as the full CME and Eq. (9) as the reduced CME. An alternative way of summarising our reduction method, is that it consists of approximating a general chemical system: by the reduced chemical system: when species X M+1 , ..., X N are abundant.

Reaction index
Reaction type Reaction rate We wish to show that the approximate marginal CME given in Eq. (9) is a good approximation to the exact marginal CME given in Eq. (7) when species X M+1 , ..., X N are abundant. To do this, we will need to define an abundance limit. Precisely, we want to know which parameters we should tweak in order that some species concentrations should go to infinity, while others stay constant. We will assume, without loss of generality, that we want to take the abundant limit of species X N . For systems with multiple abundant species, we can just repeat the below process for each one in turn.
The convention we use for numbering reactions is given in Table 2. We will introduce the functions a(i) and b(i) so that we can say that the bimolecular reaction with rate k i has species X a(i) and X b(i) as its reactants, where a(i) = b(i). We will have N input reactions with rate k i , i = −1, ..., −N which lead to the production of each species, monomolecular reactions with rates k i , i = 1, ..., N , bimolecular reactions between different species with rates k i , i = N + 1, ..., L (for some L ∈ N), and bimolecular reactions between the same species with rates k i , i = L + 1, ..., L + N . Now the rate equation for the concentration of X r is: where S ri is the net change in the number of molecules of species X r when reaction i occurs. An intuitive means to obtain an abundant species X N is to make the rate constants of the reactions which remove this species, to be very small, whilst the rest of the rate constants remain at their constant value. In particular, if exactly one molecule of X N is a reactant, then we let k j ∝ 1 x ; if two molecules of X N are reactants, then we let x for j such that a(j) or b(j) equal N and j = N + 1, ..., L. In what follows we shall refer to these rate constant limits as the abundance or abundant limit. Plugging in the aforementioned rates constant scalings and the trial solution: in Eq. (12) where c i are constants independent of x, and considering steady-state by setting the time derivative to zero, one obtains a set of N simultaneous equations in the N constants c i (i = 1, ..., N ). Importantly the coefficients of these simultaneous quadratic equations are independent of x which implies that if these equations can be solved for c i then the solution is independent of x, as they should indeed be, given the form of the trial solution above. Thus it follows that provided the simultaneous equations can be solved, the steady-state solution of the REs is given by Eq. (13). Note this implies that that in the abundance limit, the ratio of the abundant to the non-abundant species concentrations, φ N /φ i (i = N ), scales as x where x → ∞, whilst the concentration of the non-abundant species remains constant. Note also that we have here implicitly assumed that there is no chemical conservation law which involves an abundant species (chemical conservation laws which involve only non-abundant species are however allowed). This is since this law necessitates a finite upper bound on the concentrations which is contrary to the manner in which we have here defined the abundant limit.
For systems with multiple abundant species, the above recipe implies that k j ∝ 1 x q where q is the total number of reactant molecules of abundant species involved in reaction j. For example for the reaction X h + X v → ... where both species X h and X v are abundant, the rate constant of the reaction scales as 1 x 2 . The limits here derived assume a steady-state rate equation description for all species. This derivation is here presented to simplify the presentation and since it is very intuitive. However as we show in the next section, the limits elucidated here, also constitute abundance limits for the time-dependent stochastic description.

C. Proof of N Species Abundant Convergence
We will now use the limits derived in section 2.2 to prove that the approximate marginal distributionP (n ′ , t) governed by the heuristic marginal CME Eq. (9) converges to the exact marginal distribution P ⋆ (n ′ , t) governed by the exact marginal CME Eq. (7) in the abundance limit.

Taylor expansion of exact marginal distribution
A full N species CME with R reactions has the form of Eq. (1). We will expand the solution P (n, t) as a Taylor series in time about t = 0. We assume deterministic initial conditions, P (n, t = 0) = δ n 0 1 n1 ...δ n 0 N nN , where n 0 i denotes the initial value of n i . We can write the Taylor expansion: where P (k) is the k th time derivative of P . Since the full CME is a coupled set of first-order ordinary differential equations with constant coefficients, the Taylor series above is guaranteed to have an infinite radius of convergence by Fuchs's theorem [25]. From this series we can compute the marginal distribution: We already know P (n, 0) = δ n 0 1 n1 ...δ n 0 N nN , so our first problem is the second term of the expansion, which is the first time derivative. This is given by the CME Eq. (1) itself: and thus we can compute the k th derivative, where we are careful to note that P (k) (n, 0) ≡ Ṗ (n, 0) k , since the E operators do not commute with the propensitieŝ f j . Now we can compute the marginal distribution P ⋆ , which is made much simpler by the presence of the Kroneckerdeltas: where E x i ′ now acts on the initial conditions n 0 i rather than the variable n i .

Taylor expansion of the approximate reduced distribution
The approximate marginal distributionP (n ′ , t) is defined by the reduced CME: Its Taylor expansion about t = 0 is:P Fuchs's theorem guarantees that a first-order ordinary differential equation with time-dependent coefficients will have a radius of convergence at least as large as the minimum of the radius of convergence of the time-dependent parameters [25]. The reduced CME is a set of coupled first-order equations with time-dependent coefficients determined by the solution of the REs. Hence if the REs admit a Taylor series solution with an infinite radius of convergence then the Taylor series of the reduced CME also does.
We already knowP (n ′ , 0) = δ n 0 1 n1 ...δ n 0 M nM , so our first problem is the second term of the expansion. Again, this is given by the approximate CME Eq. (19), where we note that the propensitiesf j will in general depend on t as well as n ′ . This will cause complications, for instance, the second derivative has the form: We now have an extra term in our sum which depends on the time derivative of thef j , and if we take higher order Taylor coefficients, we get higher time derivatives of thef j . We get, using the notation used in the previous section,

Convergence of full and reduced Taylor series
The absolute difference between the k th terms of the two Taylor CMEs is given by the difference between Eqs. (18) and (22): This will tend to zero in the abundant limit if we can prove two claims: (I) For any α i ∈ Z, and j ∈ 1, ..., R, in the abundant limit.
(II) The time derivatives of thef j tend to zero in the abundant limit.
To prove (I), we must recall the different possible forms thatf j andf j can take, which are given in Table 3.
As before, we say that the indices i, r ≤ M correspond to non-abundant species, while the indices h, v > M correspond to abundant species. Convergence is trivial for reactions ∅ → ..., X i → ..., X i + X r → ..., and X i + X i → ..., sincef j andf j agree. For the other reactions, we have to decide how the initial conditions should scale in the abundant limit. We will suppose that the initial concentration for species X M + 1, ..., X N should tend to infinity O(x) (since they are the abundant species), while the initial concentration for species 1, ..., M should stay constant.
For the X h → ... reaction, we are interested in However by definition, φ h (0) = n 0 h Ω , so the absolute error becomes kj α h Ω which tends to zero since k j → 0 in accordance with our abundant limit elucidated in section 2.2.
For the X h + X v → ... reaction, we are interested in which simplifies to and which again tends to zero, since in the abundant limit k j ∝ 1 x 2 while n 0 h and n 0 v are proportional to x and x → ∞.
For the X h + X h → ... reaction, we are interested in, which simplifies to and which again tends to zero since k j ∝ 1 Finally, for the X i + X h → ... reaction we consider which we write as and which also tends to zero since k j → 0 in the abundance limit. We have therefore proved claim (I).
To prove claim (II), we will use the convention for reactions introduced in Table 2. We will say that k 0 corresponds to the rate of a null reaction, k 1 , ..., k N correspond to the rates of monomolecular reactions, k N +1 , ..., k L correspond to the rates of bimolecular reactions involving distinct species with reaction k j involving species X a(j) and X b(j) , and k L+1 , ..., k L+N correspond to homodimerisation reactions.
We will prove claim (II) for the reaction Note that generally we havef j (Ω, n ′ , φ(0)) but because the reaction involves two abundant species there is no explicit dependence on Ω and n ′ and hence in what follows we use the less cumbersome notationf j ( φ(0)). Using Leibniz's rule,f For this reaction, k j ∝ 1 i (0) is bounded in the abundant limit, k > 0, then the above expression forf (k) j will tend to zero in that limit, owing to the limiting prefactor k j . Firstly, the definition of the first derivative of φ h is given by the rate equation: Wherever we have a concentration which tends to infinity, (φ j (0), j = M + 1, ..., N ) it is by definition cancelled out by the corresponding parameter k j . So this expression remains constant, and therefore bounded, in the abundant limit. For the purposes of mathematical induction, assume now that all derivatives up to φ (k) h (0) remain bounded in that limit. Then, using Leibniz's rule, The only terms here that have a possibility of being unbounded are those involving zeroth derivatives of the φ j , specifically, But these φ j (0) will only go to infinity if j > M , and in that case the reaction rate k j will go to zero at least fast enough to counteract the limiting concentrations. So therefore we have shown that each time derivative of the concentrations is bounded in the abundant limit, and therefore each time derivative of thef j goes to zero in that limit. The proofs for the other reactions are very similar. Therefore we have proved claim (II), and consequently we have proved the convergence of the full marginal CME and the reduced CME for all times, in the abundant limit.

III. THE ACCURACY OF THE REDUCTION FOR FINITELY LARGE ABUNDANT CONCENTRATIONS AND SPECIAL CASES
The abundant limit as stated previously, is the limit that the rate constants of the reactions removing the abundant species go to zero (in a particular manner) which ensures that the ratios of the abundant to non-abundant concentrations go to infinity. It is in this limit that we have proved that the difference between the reduced and full marginal CME goes to zero. Generally we are interested in the case where the ratio of the abundant to non-abundant concentrations is finitely large, not infinite. In this case the reduced CME is approximate. Given two identical copies of a chemical system, one placed in a small volume and the other in a much larger volume, and given they have the same finite large ratio of abundant to non-abundant concentrations, we expect the reduced CME to be a better approximation for the system confined in the larger volume. The reason is that the number of molecules of abundant and non-abundant species is larger in the system confined in the larger volume and hence the REs in this case provide a good approximation to the dynamics of the abundant species [26,27], the key principle behind our reduced CME. Another equivalent point of view is that in the larger volume the reactions occur on faster timescales than the reactions in low volumes, due to the larger number of interacting molecules and hence the dynamics of the abundant species in the larger volume are more amenable to being modelled by a continuous approach like the chemical Langevin equation or REs [28].
Of course, as Gillespie pointed out in Ref. [28], it is difficult to tell how large should the compartment volume be so that a macroscopic approach becomes a feasible approximation. Hence in this section we explore the accuracy of the reduced CME, for finite ratios of abundant to non-abundant concentrations, by means of stochastic simulations. In particular we will use the SSA [6] to sample the probability distribution of the full CME (1) and the Extrande algorithm [29] to sample the probability distribution of the reduced CME (9) for the case where the rate constants of the reactions removing the abundant species scale as k j ∝ 1 x q (where q is the total number of reactant molecules of abundant species involved in reaction j) and x is a finite real number. Note that the reduced CME cannot be sampled using the SSA since the propensities are generally time-dependent and hence the use of Extrande. An alternative to the use of the latter algorithm would be to use a method involving the numerical integration of reaction propensities [30]. We also show in this section that curiously, for some chemical systems, the exact and approximate marginals are identical even without taking abundance limit.
A. Stochastic simulations

Homodimerisation
We will investigate an open homodimerisation reaction studied in Ref. [26]: Species X 1 is produced with rate k 1 , and is either consumed with rate k 3 or else forms a dimer X 2 with rate k 2 . The dimer X 2 is then consumed with rate k 4 . We consider the case where X 1 is abundant and X 2 is not. The RE for the concentration of X 1 is:φ which has the solution: where Therefore the approximate reduced CME Eq. (9) for non-abundant species X 2 is given by: The parameter x equals 1, 10, 100, 1000 in (a)-(d), respectively.
In Figs. 1 and 2 we compare the exact marginal distribution of X 2 from the full CME and the approximate marginal distributions of the reduced CME. These are obtained by means of an ensemble average of SSA and Extrande trajectories, respectively. In Fig. 1 we plot the distributions for four different relative abundances of X 1 and X 2 at the same time point t = 0.01. The abundance is adjusted by choosing the rate constants to scale as k 2 ∝ 1 x 2 and k 3 ∝ 1 x (in accordance with the limits delineated in Section 2.2) and varying x over a certain finite range (see caption of Fig. 1). The distance between the two distributions clearly becomes smaller as the ratio of the abundant to non-abundant species concentrations increases, in line with the proof of the previous section; the two distributions are practically indistinguishable when this ratio is of the order of 100. Nevertheless we find that some salient features of the two distributions are fairly similar (namely the position of mode and the width of distribution) over a large range of the ratio of the abundant to non-abundant species concentrations (0.4 to to 263). In Fig. 2 we show that the high accuracy of the reduced CME also extends to predicting the whole time-evolution of the distribution. Both of these figures indicate that the results of simulations using the reduced CME bear a significant closeness to those obtained using the full CME under a wide range of abundances and hence point towards the utility of the reduced CME as a low dimensional approximation of the full CME.

Genetic Feedback Loop
We next investigate a negative genetic feedback loop studied in Ref. [2].
The "on" promoter D on produces proteins P with rate v 0 . These proteins bind to the promoter with rate d 1 to generate the inactive "off" promoter D off , which can then unbind back into active promoter and protein with rate v 1 . Furthermore, the protein P is consumed by a unimolecular reaction with rate d 0 . We consider the case where P is abundant.   The REs for this system are given by: where φ 1 is the concentration of D off , φ 2 is the concentration of P , Ω is the volume and 1 Ω is the total (fixed) gene concentration (equivalent to one gene). Unlike the previous example, these equations do not admit an analytical solution, so we must solve them numerically, and then use the numerical solution in the reduced CME: where n 1 is the number of molecules of D off . Since at any one time, the gene is either on or off, the distribution of D off is Bernoulli. In particular, since the parameter of the Bernoulli distribution, θ(t), is equal toP (1, t), we can say, whereθ(t) is obtained by setting n 1 = 1 in the above reduced CME, Our reduction method therefore provides us with an "exact" solution at all times in this case, since we don't need to perform any stochastic simulations sampling the reduced CME, but rather just numerically solve the ordinary differential equation above. Figure 3 shows how the approximate expression for θ given in Eq. (45) compares with the true θ (obtained by computing n 1 from an ensemble average of SSA trajectories of the full CME) for different relative abundances at t = 50 seconds. The relative abundance is controlled by choosing the rate constants to scale as d 0 , d 1 ∝ 1 x (in accordance with the limits delineated in Section 2.2) and varying x over a certain finite range (see caption of 3). For the parameter set chosen, we find that the approximation for θ using the reduced CME is in good qualitative agreement with that calculated from the full CME when the ratio of abundant to non-abundant concentrations varies over the range 10 3 − 10 7 . In particular both the full and reduced CME predict that the probability of the gene being in the off state increases monotonically, in a step-like manner, as the protein concentration increases at constant gene concentration (consistently with a negative feedback loop). For low relative abundances (ratios less than a few hundreds), the approximate θ is almost double the true value implying that the reduced CME in this case over-estimates the strength of the negative feedback.

Metabolic network
We consider an arbitrarily large, sequential enzyme reaction network which has been previously associated with metabolism [31,32]. The network consists of N + 1 enzymes, each converting a substrate into a product which then serves as the substrate for the next enzyme in the cascade. The first substrate is produced by a zeroth-order reaction, and the final substrate is converted into a product which we ignore. We seek the approximate distribution of the enzymes, which we expect to be exact in the limit where substrates are abundant. The full chemical system is described by the scheme: . . .
The conservation law n Ei + n Ci = D i is implied for each enzyme, where D i is a given positive integer which represents the total number of both free and bound enzymes of type i, which is constant in time. According to our method, encapsulated by Eq. (11), the reduced chemical system takes the simpler form: It is hence clear that the molecule numbers of each enzyme species are binomially distributed in steady-state conditions: where φ Si is the steady-state solution of the REs of the full system, Eqs. (46), given by: For a time-dependent description, the reduced CME corresponding to the reduced chemical system (47) cannot be exactly solved and stochastic simulations are required. In Fig. 4 we plot both the approximate and exact distributions (using Extrande for the reduced system and the SSA for the full system) of the enzyme E 1 at a fixed time for different abundances of substrate S 1 . It is clear that the approximation improves as the substrate becomes more abundant than enzyme, and is essentially exact in the bottom right panel where the relative abundance is Ω φS 1 D1 = 10. It is also remarkable that the approximation is good even when there is essentially no clear separation in abundance, i.e, ΩφS 1 D1 = 1. Indeed even though the approximation suffers quantitatively when the relative abundance is not high, yet it captures the main distinctive qualitative feature, namely that the distribution changes from positive to negative skewness as a function of the relative abundance (the switch happens at a relative abundance between 1 and 10).
For this system we have the added benefit that the distribution of the number of molecules of each enzyme E i is independent in the approximate description. This means that if we are interested in the distribution of the number Simulations were performed using MATLAB on a computer with a 3GHz Intel Core 2 Duo processor and 4GB RAM. Note that as the total number of species increases, stochastic simulations of the reduced system (approximate) becomes significantly more computationally efficient than the SSA for the full system (exact).
of molecules of a given enzyme, say, E 1 , then we only need to simulate the three reactions involving that particular enzyme, rather than the 3N + 4 reactions of the full system. There is therefore a marked reduction in computational time for our reduced SSA, particularly for large N , as shown in Fig. 5, where the approximate SSA is roughly 3 times faster than the exact SSA when N = 20. We note, however, that the computational time of the approximate method does increase slightly with N , owing to the need to solve a coupled system of 2N + 2 rate equations.

Genetic oscillator with transcriptional feedback
We consider an arbitrarily large gene regulatory network which has been previously studied as a model of a circadian oscillator [33,34]. The mechanism is as follows. A protein P 1 is translated by mRNA, M , which is itself transcripted by a gene in the on-state, D on . Subsequently the protein P 1 generates P 2 , and P 2 generates P 3 , etc until a final protein P N is generated. The latter can bind to D on to deactivate it as D off , which can reversibly unbind into P N and D on . We seek the approximate distribution of the number of molecules of D on and M , which we expect to be accurate when the proteins are abundant. The full chemical system is described by the scheme: According to our method, encapsulated by Eq. (11), the reduced chemical system takes the simpler form: We note that the distribution of D on is independent of M , and is therefore simply Bernoulli v1 d1φP N (t)+v1 . The steadystate distribution of M can be straightforwardly obtained using the method in Ref. [2] or else a fast implementation of the finite-state projection algorithm is equally effective [15]. For a time-dependent description, however, we must use stochastic simulations to determine the accuracy of our method. In Fig. (6) we plot the time development of the distribution of the number of mRNA molecules, M , for parameters such that the steady state is characterised by a fixed large abundance of proteins, in particular, φP i φM = 100, ∀i, which is a physically realistic ratio for some cells [23]. Remarkably the approximate distribution provides an excellent match to the exact distribution for all times, reproducing even the transition from unimodality to bimodality and back to unimodality as a function of time. In Fig. (7) we plot the computational time taken to simulate an individual trajectory of length 10 time units with the SSA for the full system (50) and Extrande for the approximate system (51). For the approximate system, only 4 reactions must be simulated, while the full system has N + 5 reactions. On the other hand, the approximate system requires the time-dependent solution of an N + 2 dimensional system of REs. This trade-off implies that for N = 1, 2, the full system is slightly faster, but for any N > 3, the approximate system is faster.

B. Exact Reductions
We have already shown that our approximation is exact in the limit of infinite concentration of the abundant species, but as we now show, surprisingly, there is also a wide class of systems where the method is exact, regardless of abundance separation.

Systems in Detailed Balance
We now show that for systems in detailed balance [1], where the species we remove from the system (what we would previously call "abundant" species) are not involved in a chemical conservation law, the approximation is exact. Note that detailed balance is a property of some systems in steady-state conditions, and hence necessarily, the exactness mentioned does not apply to finite times, rather it applies only in the limit of infinitely long times.
Consider a detailed balance system of R reversible reactions, and N chemical species, X 1 , ..., X N . Let us denote reaction j as: .. = kN+1 = 0.01. Simulations were performed using MATLAB on a computer with a 3GHz Intel Core 2 Duo processor and 4GB RAM. Note that as the total number of protein species increases, stochastic simulations of the reduced system (approximate) becomes significantly more computationally efficient than the SSA for the full system (exact).
where we allow a conservation law on the species X 1 , ..., X M of the form: where α i and k are time-independent constants. Application of the method described by van Kampen in [35], leads to an explicit expression for the steady-state solution of the CME of this chemical system which is given by a constrained multivariate Poisson distribution of the form: where φ i are the steady-state rate equation solutions, δ(..., ...) is a Kronecker delta and C is a normalisation constant. The marginal distribution is obtained by summing over n M+1 , ..., n N . The exact marginal is therefore, where C ′ is a normalisation constant. Now the approximate reduction introduced in Section 2 is equivalent to approximating the chemical system (52) by the reduced chemical system: with the same conservation law as above. Application of the method in [35] to the reduced CME describing the above system, immediately leads to a steady-state solution which is exactly the same as Eq. (55) since the RE solution of the reduced system is the same as that of the full system. Hence the approximation is exact for this class of chemical systems in detailed balance.

Open Michaelis-Menten reaction with one enzyme molecule
In the following example, we show that the approximation can be exact in steady-state conditions without taking any abundant limits, even if the system is not in detailed balance.
The open Michaelis-Menten reaction is given by: where substrate molecules S are input into the system, they reversibly bind with enzyme E to form a complex C which in turn irreversibly decays into the original enzyme E and product molecules P . We will consider the case with the conservation law n E + n C = 1, that is where there is just one enzyme molecule in the compartment. The CME describing the above reaction system is: This equation has been solved exactly in steady-state conditions in Appendix G of Schnoerr et al. [36]. In particular therein it was shown that the average enzyme molecule number in steady-state condition is given by n E = 1 − kinΩ k2 . This together with the fact that a single enzyme molecule, at any given time, can be in only one of two states, implies that the steady-state marginal distribution of enzyme number fluctuations is: Next we show that our reduction gives exactly the same distribution, regardless of the abundance of substrate. The reduced CME describing enzyme fluctuations is given by: In steady state, setting n E = 0 gives us: Therefore, with the conditionP (0) +P (1) = 1, we find that, The REs for this system are:φ which possess a steady state solution: Substituting Eq. (64) into Eq. (62), we find, As by arguments before, the steady-state distribution is Bernoulli and hence it follows that: which is equal to the exact solution Eq. (59). By similar arguments, it can be easily deduced that the marginal distribution of any species which exists in two states and for which the average number of molecules predicted by the REs is the same as the CME, is exactly predicted by the reduced CME. The second criterion on average molecule numbers is bound to generally be the limiting one since it is typically not the case that the REs exactly agree with the mean concentrations calculated from the CME (see for example Ref. [26]). For example for the genetic feedback loop (41) the marginal distributions of the gene in the on or off state cannot be exactly predicted by the reduced CME because as shown in Ref. [2], the average number of genes in each state (equivalently the fraction of time spent in each state) predicted by the CME does not equal that of the REs.

IV. ESTIMATING THE APPROXIMATION ERROR OF THE HYBRID MODEL
As we have shown for most systems, the reduction is exact only in the limit of infinite concentrations of certain species, and the reduction is therefore an approximation if concentrations are finitely large.
We now investigate the use of the Linear Noise Approximation (LNA) to obtain an estimate of the error made by the use of the reduced CME. By comparing this estimate with that obtained from stochastic simulations of both the full and approximate systems, we demonstrate that the LNA's estimate is accurate for a wide range of parameters and systems. Since the LNA is obtained by solving a system of coupled ordinary differential equations, our results suggest the use of the LNA as a computationally efficient means of estimating the error which bypasses lengthy stochastic simulations using stochastic simulations of the full and reduced CMEs.
The LNA is an approximation which assumes the fluctuations in each chemical species are normally distributed. More precisely, it is the leading order approximation of the system-size expansion of the CME [1] in the limit of large volumes. The general formulation of the LNA is as follows (see for example [37] for more details).
Consider a system of N chemical species, with R reactions, where the j th reaction is: The REs for the system are then given by: where we remind the reader that S is the stochiometric matrix with elements S ij = r ij − s ij and f is the macroscopic propensity vector f defined as: The Jacobian matrix J is the derivative of Eq. (68) with respect to φ: The diffusion matrix D is given by the matrix product: The time-evolution of the second moments of the fluctuations is then approximately given by the Lyapunov differential equation: where ΩC ij is the LNA estimate for the CME's prediction of the covariance in the number fluctuations of species X i and X j . Now the proposed reduction approximates reaction scheme (67) by: when species X M+1 , ..., X N are the abundant species. Note that for the reduced system with M species, the abundant concentrations no longer function as concentrations, but instead as parameters like the reaction rates k j . The REs remain unchanged, however. The Jacobian and diffusion matrices for this system,J andD , are hence the upper left M × M blocks of J and D previously defined in Eqs. (70) and (71), respectively. Thus the LNA leads to an estimate of the reduced CME's prediction of the covariance of fluctuations, ΩC, which is the solution of the Lyapunov equation: Hence it follows that the LNA's estimate of the absolute relative difference in the variance predictions of the full and reduced CME's for species i is given by: Of course one can also calculate this quantity as a function of time, by solving the Lyapunov equations of the full and reduced CMEs numerically; however in what follows we shall assume steady-state conditions to simplify the presentation. Though its generally impossible to obtain a closed-form simple analytical solution to the LNA equations, one can show that the error R i is approximately proportional to the inverse of the ratio of the abundant to non-abundant species concentrations in the abundant limit. The proof is as follows.
Referring to Table 2, if species X N is abundant, then the abundant limit consists of the rate constant scalings: x for j such that a(j) or b(j) equal N (j denotes a bimolecular reaction which involves X N and another species) and the steady-state concentration scalings: φ i = c i for i = N , where c i are constants independent of x and φ N ∝ x. It is easy to verify using this limit and the REs given by Eq. (12) that the Jacobian matrix can be written as J = J 0 + yJ 1 , where y = 1/x and J i are matrices to be determined from the REs. In particular J 0 is the Jacobian of the REs with the terms describing the removal of the abundant species set to zero. The same scaling form for the Jacobian is obtained for any number of abundant species.
On the other hand, the diffusion matrix D is unchanged under the abundance limit. This is since by Eq. (71), the elements of the D are linear functions of the macroscopic rate functions f j (see Eq. (69)) which are unchanged by the abundance limit since each limit of k j tending to zero will be counterbalanced by the opposite limit of a concentration of an abundant species tending to infinity.
Hence in the abundance limit, the Lyapunov Eq. (72) can be written as: The form of this equation suggests a solution of the type C = C (0) + C (1) y + C (2) y 2 + O(y 3 ). Indeed plugging this ansatz in the above equation, one transforms it into a coupled set of equations for the matrices C (i) which can be solved iteratively, i.e., the equation for C (i) depends on C (j) where j < i except for C (0) which is a function of J 0 and D only. Now the abundance limit is the limit y = 1/x → 0 and hence the relative error in the variance can be written as: Now in the abundance limit, the ratios of abundant to non-abundant concentrations are proportional to x = 1/y and hence it follows that in this limit, the relative error R i is proportional to the inverse of these ratios. Next we demonstrate the accuracy of the LNA's estimate of the relative error in the predictions of the reduced CME, i.e., Eq. (75). This is done by comparison of the LNA estimate with the relative error directly computed from the SSA of the full CME and the steady-state analytical solution of the reduced CME for three examples of biochemical relevance.

A. Open Michaelis-Menten reaction with multiple enzyme molecules
We consider the open Michaelis Menten system (57) with multiple number of enzyme molecules, i.e., (n E +n C ) = E T , where E T is the total number of enzyme molecules. We consider the case where the substrate is much more abundant than the enzyme. Computing the LNA of the full CME and of the reduced CME for the non-abundant enzyme species, we find that the relative error in the variance of the enzyme number fluctuations, as given by Eq. (75), is: where a = kinΩ k2ET and K M = is the Michaelis-Menten constant. The steady-state substrate concentration solution of the REs for this system is: We define L = ΩφS ET as a measure of the relative abundance of substrate S. It then follows that R can be written as: The condition 0 < a < 1 is a requirement for the existence of a steady state, and R is a monotonically increasing function of a, so the maximum possible value of R is at a = 1, in other words, That is, if the substrate concentration is ten times the total enzyme concentration, then the percentage relative error in the reduced CME's estimate of the variance of enzyme number fluctuations will be less than ten percent. The reduced CME can in this case be exactly solved in steady-state conditions and one obtains a binomial distribution with parameters E T and 1 − a describing the fluctuations in enzyme molecule numbers; indeed for the case E T = 1, the binomial distribution reduces to the Bernoulli distribution found earlier for the open Michaelis Menten system with one enzyme molecule (see Eq. (59)). In Fig. (8) we use the variance calculated from this solution together with the variance calculated from time-averages of SSA (for the full CME) to compute the true error in the reduced CME's variance of enzyme number fluctuations for the open Michaelis-Menten system. This is done for two different volumes, Ω = 1 and Ω = 10 3 . The true error is also compared in the same figure with the LNA estimate given by Eq. (80). The relative concentrations of substrate and enzyme are controlled by setting the rate constant k 1 proportional to 1/x and varying x (in accordance with the abundance limits discussed earlier; see caption of Fig. 4 for details). The LNA estimates are reasonably good for both volumes but practically indistinguishable from the true error for the larger volume of Ω = 10 3 . This is to be expected since the LNA becomes exact in the limit of large volumes.

B. Open Homodimerisation reaction
Next we use the LNA to estimate the errors in the reduced CME description for the homodimerisation example (36). We consider the case in which species X 1 is abundant compared to species X 2 . Choosing the scalings k 2 = c 1 /x 2 and k 3 = c 2 /x (where c i are proportionality constants), it follows by the considerations of Section 2.2 that the steady-state concentration of X 1 is proportional to x while that of X 2 is a constant; hence by varying x we have a convenient way to control the ratio of the two concentrations. In particular in steady-state, the solution of the REs is given by: The LNA relative error in the variance of the fluctuations of the non-abundant species X 2 as given by Eq. (75) is: where λ = (c 2 + c 2 2 + 8c 1 k 1 )k 4 /2c 1 k 1 , λ 0 = 4c 2 1 φ 2 λ 2 (k 1 − φ 2 λ(c 2 + 4c 1 φ 2 λ)), λ 1 = c 2 2 k 4 + 8c 1 c 2 φ 2 k 4 λ + c 1 c 2 2 φ 2 λ 2 + 4c 2 1 φ 2 (k 1 + 4φ 2 k 4 )λ 2 + 4c 2 1 c 2 φ 2 2 λ 3 are constants. Note that the ratio of the abundant to the non-abundant species concentrations is proportional to x. Hence by Eq. (83) it follows that the relative error has a maximum equal to λ 0 /λ 1 and decreases monotonically as X 1 becomes more abundant relative to X 2 . Next we test the accuracy of the LNA estimate. The reduced CME for this system can be exactly solved in steady-state conditions and one obtains a Poisson distribution for the fluctuations in the number of molecules of X 2 with parameter Ωk 2 φ 2 1 /k 4 . In Fig. 9 we use the variance calculated from this solution together with the variance calculated from time-averages of SSA (for the full CME) to compute the true error in the reduced CME's variance of number of X 2 fluctuations. This is done for two different volumes, Ω = 1 and Ω = 10 3 . The true error is also compared in the same figure with the LNA estimate given by Eq. (83). As for the previous example, the LNA accuracy is good across a wide range of volumes and becomes particularly accurate in the limit of large volumes. It is also noteworthy that the LNA estimate of the relative error is good over a wide range of relative abundances; in particular it even provides an accurate value (about 0.3) for the maximum relative error which occurs in the limit of small relative abundance of X 1 compared to X 2 .

C. Genetic Feedback Loop
Here we use the LNA to estimate the error in the reduced CME description for the genetic feedback loop (41), where the gene concentration is fixed to 1/Ω, i.e, a single gene. We consider the case where the protein P is much more abundant than the gene. Choosing the scalings d 0 = c 0 /x and d 1 = c 1 /x (where c i are proportionality constants), it follows by the considerations of Section 2.2 that the steady-state concentration of X 2 (the protein) is proportional to x while that of X 1 (the gene) is a constant; hence by varying x we have a convenient way to control the ratio of the two concentrations. In particular in steady-state, the solution of the REs is given by: The LNA relative error in the variance of the fluctuations of the non-abundant gene as given by Eq. (75) is: at Ω = 1 (red circles) and Ω = 10 3 (green crosses) are computed using the time-averages of SSA for the full CME and the analytical steady-state distribution of the reduced CME, and compared with the LNA estimate given by Eq. (83) (blue line). For this example, the LNA gives a good estimate of the true error for Ω = 1 and an excellent estimate for Ω = 10 3 . Parameter values are k1 = 10 3 , k4 = 10, k2 = 100 Ωλ)) are constants. Note that the ratio of the abundant to the non-abundant species concentrations is proportional to x. Hence by Eq. (86) it follows that the relative error has a maximum equal to λ 0 /λ 1 and decreases monotonically as X 2 becomes more abundant relative to X 1 . Note that the form of the LNA estimate of the error in this example is the same as that in the previous example.
In Fig. 10 we plot the true error in the variance of the fluctuations of the non-abundant gene computed using time-averages of SSA (for the full CME) and the analytical solution of the reduced CME in steady-state conditions (a Bernoulli distribution with parameter given by the steady-state solution of Eq. (45)) at Ω = 1. This is compared with the LNA estimate Eq. (86) which is found to be particularly accurate, as found previously for the enzyme and dimerisation examples. However unlike the previous examples, for the gene system, in the limit of large Ω, the LNA's estimate does not become more accurate. The reason is that the LNA is accurate in the deterministic limit in which all species molecule numbers increase to infinity at constant concentration whereas in this example the gene molecule number is fixed to one and only the protein molecule number is taken to infinity.

D. Genetic oscillator
Finally, we use the LNA to estimate the error in the reduced CME description for the genetic oscillator (50). We consider the case where the proteins P i are more abundant than the mRNA, while the gene is restricted to a maximum concentration of 1 Ω , i.e, a single molecule. Choosing the scalings d 1 = c 1 /x and k 2 = ... = k N +1 = c 2 /x (where c i are proportionality constants), it follows by the considerations of Section 2.2 that the steady-state concentrations of P i (the proteins) is proportional to x while that of D on (the gene) and M (the mRNA) are constant; hence by varying x we have a convenient way to control the ratio of the two concentrations. In particular in steady-state, the solution  (41) in steady-state conditions. The true relative error at Ω = 1 (red circles) is computed using a time-average of SSA for the full CME and the analytical steady-state solution of the reduced CME, and compared with the LNA estimate given by Eq. (86) (blue line). Parameter values are v0 = 1, v1 = 0.1, d0 = 1 x , d1 = 0.01 x , where 1 < x < 10 4 . The relative abundance of protein to gene concentrations is φ2/φ1 (according to the REs) of the REs is given by: Given the arbitrarily large number of species, there is no compact analytic expression for the LNA relative error in the variance of the fluctuations of the non-abundant mRNA as given by Eq. (75), however the error can be calculated by numerical solution of the REs and the Lyapunov equations of the full and reduced systems. In Fig. 11 we plot the true error in the variance of the fluctuations of the non-abundant mRNA computed using time-averages of SSA (for the full CME) and the solution of the reduced CME in steady-state conditions (computed with the finite-state projection algorithm) at Ω = 1. This is compared with the LNA estimate which is found to be accurate for physically realistic abundances (ratio of protein to mRNA concentrations are commonly larger than a hundred in bacteria; see for example [23]). However unlike some of the previous examples, for the gene system, in the limit of large Ω, the LNA's estimate does not become more accurate. The reason is that the LNA is accurate in the deterministic limit in which all species molecule numbers increase to infinity at constant concentration whereas in this example the gene molecule number is fixed to one and only the protein molecule number is taken to infinity.

V. SUMMARY AND DISCUSSION
Summarising, in this paper we have introduced a novel reduced stochastic description of chemical systems in which some species are abundant. The key intuitive idea is to replace the conditional expectation of the number of molecules of abundant species in the propensities of the exact marginal CME by the solutions of the deterministic REs, and hence obtain a reduced CME for the non-abundant species only. Therefore our method is a hybrid approach. We note that our method is different and simpler than that presented in [19,21,22] since the latter postulate a more complicated approach than REs to model the abundant species. This relative simplicity indeed leads to three major strengths of our approach over existing approaches: (i) it is easier to implement and computationally more efficient than present approaches; (ii) our reduced CME can be explicitly solved for a number of biochemically relevant examples; (iii)  (50) in steady-state conditions. The true relative error at Ω = 1 (red circles) is computed using a time-average of the SSA for the full CME and the steady-state solution of the reduced CME computed with the finite-state projection algorithm, and compared with the LNA estimate (blue line). Parameter values are N = 10, v0 = 100, v1 = 1, d0 = 1, d1 = 1 x , k1 = 1, k2 = ... = kN+1 = 1 x where 1 < x < 10 3 . The relative abundance of protein to mRNA concentrations is φP i /φM (according to the REs) which is the same for all i = 1, ..., N .
simple rational expressions can be derived which estimate the errors inherent in the hybrid approximation relative to the fully stochastic description. Curiously we also found that the reduced CME at the heart of our hybrid method is exact for some chemical systems, i.e., without requiring the necessity of abundance separation or without restricting the system to purely monomolecular systems (as was found to be the case in [21] to ensure exactness for the Hellander and Lotstedt model). The major disadvantage of our approach is that its unlikely that it will be able to capture as many features of the fully stochastic model as the more sophisticated approaches mentioned above. The present work also suggests some new avenues of research. The first is finding exact error bounds for the reduced CME, which could provide useful reassurance for mathematical modellers using this method. A second interesting direction would be to develop a more refined reduction of the CME by replacing the conditional expectation of the number of molecules of abundant species in the propensities of the exact marginal CME by the solutions of effective mesoscopic rate equations (EMREs) [26] instead of REs. EMREs have been demonstrated to be more accurate than REs in the sense that the difference between their mean concentration solution and that of the CME is considerably smaller than the difference between the RE solution and that of the CME [27]. Hence a CME reduction based on EMREs is highly likely to be more accurate than the one developed in this paper, particularly for cases where the compartment volume is small such that even though there is a large ratio of abundant to non-abundant concentrations, the number of molecules of abundant species is small. Finally another interesting area for future work is the relationship between time scale separation and abundance separation. It is clear that latter does not typically imply the type of time scale separation typically used to obtain reduced CMEs (see for example [16]) because it does not lead to a partitioning of reactions into fast and slow types; this is since within our abundance separation method, each limit of a rate constant tending to zero is counterbalanced by the opposite limit of a concentration of an abundant species tending to infinity. Yet it is not difficult to show that our abundance separation limit does lead to a separation of the eigenvalues of the Jacobian and hence point to timescale separation of concentration transients on the deterministic level. Thus a deeper investigation into the relationship between abundance separation and time scale separation could improve understanding of both types of separation and as well lead to a clearer picture regarding when CMEs can be effectively reduced.