Tensor-train approximation of the chemical master equation and its application for parameter inference

In this work, we perform Bayesian inference tasks for the chemical master equation in the tensor-train format. The tensor-train approximation has been proven to be very efficient in representing high dimensional data arising from the explicit representation of the chemical master equation solution. An additional advantage of representing the probability mass function in the tensor train format is that parametric dependency can be easily incorporated by introducing a tensor product basis expansion in the parameter space. Time is treated as an additional dimension of the tensor and a linear system is derived to solve the chemical master equation in time. We exemplify the tensor-train method by performing inference tasks such as smoothing and parameter inference using the tensor-train framework. A very high compression ratio is observed for storing the probability mass function of the solution. Since all linear algebra operations are performed in the tensor-train format, a significant reduction of the computational time is observed as well.


I. INTRODUCTION
Traditional chemical kinetic models use ordinary differential equations (ODEs) to predict the concentrations of the involved molecule types. The evolution of the corresponding probability distribution is given by the chemical master equation (CME) 1 which, in principle, can be solved by numerical integration. In practice, the state space even of simple models is too large for a naive integration of the CME. Therefore, a number of approximation techniques have been developed over the years, e.g. stochastic simulation methods [2][3][4] and suitable time and space discretizations 5,6 . Many CME approximations are based on the observation that the probability mass is often concentrated on a small fraction of the state space. For example, the finite state projection method solves the CME on a rectangular subspace with appropriate boundary conditions 7,8 . A problem here is that the region where the significant part of the probability mass function (PMF) is located may change over time. This is tackled in the sliding window method, where the region of interest is adapted based on the solution at the previous time step 9 . Unfortunately, the computational cost of these methods grows exponentially with the number of species, due to the fact that the system states must be labeled explicitly to cast the CME into a ODE.
A different line of research has explored approximations of the CME based on low-rank tensor formats [10][11][12][13][14] . The idea is to project the probability distribution onto a subspace of the tensor-product space induced by the reaction system. The solution is then propagated by a small time step and projected back onto the chosen space. Alternatively, considering time as an additional dimension, a joint approximation of the space-time system in a low-rank tensor format can be obtained 10,11 . The low-rank tensor representation not only preserves the structure of the CME, but is also much more memory-efficient compared to the matrix representation approach 15 . In addition, the low-rank tensor representation allows for accurate, dynamic approximations via rank rounding. a) Electronic mail: ion@temf.tu-darmstadt.de Low-rank tensor decompositions have also been considered in the context of parameter-dependent CMEs, however, with limited applications so far 10 . For example, considering systems and synthetic biology, stochastic chemical kinetics are used to construct quantitatively predictive models of biomolecular circuits. This requires the solution of an inverse problem, where it is often necessary to estimate the rate parameters of a structurally known candidate system from time course data. For population snapshot data, e.g. obtained from flow cytometry measurements, calibration via moment-based inference is a well-established method [16][17][18] . In the last decade, advances in fluorescence microscopy have led to an increasing availability of single-cell time course data. When the number of observed cells is small, standard moment-based methods break down because they rely on the central limit theorem to compute a likelihood function for sample moments [19][20][21] . In such a scenario, inference based on the path likelihood of individual trajectories provides a principled way to extract more information from the data. Unfortunately, these approaches are computationally challenging since they require integrating the CME multiple times for different parameter configurations. More effective approaches can be obtained by approximating the likelihood, e.g. by Gaussian moment closure 22 or the linear noise approximation 23,24 . While reducing the computational demand significantly, the underlying approximations are not applicable to systems with low copy number of species, such as single genes. Alternatively, approximate likelihoods can be computed via particle filtering 25 . For Bayesian parameter inference, the approximate likelihood expression are typically used within a Markov chain Monte Carlo scheme. Alternatively, the parameters can be included into an augmented state space allowing for direct estimation via sequential Monte Carlo 26 .
In this work, we suggest a framework for performing Bayesian inference tasks for the parameter-dependent CME, by exploiting the so-called tensor-train (TT) decomposition 27 to approximate the joint distribution over the CME states and parameters. For that purpose, we construct an explicit representation of the evolution operator in the TT format and show that it can be constructed without ever assembling the corre-sponding matrix. The TT format has the advantage that the storage requirement scales linearly with respect to the number of dimensions, while at the same time being a numerically robust tensor decomposition 27,28 . We also develop a timedomain solver based on the time-dependent alternating minimal energy (tAMEn) algorithm 29 , which additionally incorporates parameter dependence. To that end, we combine the the state space and the parameter space into a higher-dimensional tensor-product space. The parameter dependence is expressed by means of a B-spline basis expansion and a Galerkin formulation is employed to derive the multilinear system with respect to the full tensor. Since typically every reaction is governed by an individual rate constant, the parameters can be seamlessly included in the tensor representation, thus allowing for efficiently solving the joint system. In practice, however, the system parameters are often unknown. Therefore, we develop a framework for filtering, smoothing, and parameter inference based on the efficient TT representation of the joint system. The proposed framework allows us to perform Bayesian inference for the model parameters with a single forward-backward pass, as demonstrated on several synthetic examples.
The remaining of this paper is organized as follows. In section II, we recall the CME and explain how it can be expressed in tensor format. Next, in section III, we present the TT decomposition, apply it to the tensor-formatted CME, and present a TT-based solution method. In section IV, we consider the setting of a CME with parameter dependencies, which we include in the TT-based CME format. Subsequently, we exploit the TT-format of the parameter-dependent CME for inference tasks, such as filtering, smoothing, and parameter identification. Numerical results are presented in section V, where we validate the TT-based CME solver and showcase the benefits of performing inference in the TT-format. The paper closes with our conclusions, presented in section VI.

II. CHEMICAL MASTER EQUATION IN TENSOR FORMAT
We consider a well-mixed reaction system with chemical species denoted with { 1 , ..., } involved in reactions. We consider reactions of type: with = 1, . . . , and , , , ∈ N 0 , = 1, . . . , . The state of the system at time is described by the vector x ∈ N 0 which contains the number of elements per species at that time instant. To describe the change in the state vector after reaction occurs, we introduce the stoichiometric change vector ν ( ) ∈ Z , the th element of which is given as ( ) = , − , . The change in the state vector due to the th reaction is then given as Assuming a stochastic model of the system, the state vector x is a realization of a continuous-time jump process {X ( )} ≥0 . Then, the evolution of the time-dependent PMF is described by the so-called chemical master equation (CME) 1 , such that where (0) is the initial probability and is called the propensity function. For a well-mixed system at thermal equilibrium, the mass-action propensity reads where is the so-called specific probability rate, which is a measure for the probability that reaction occurs.
Until now, the CME has been defined for an infinite state space N 0 , which is computationally intractable, thus inappropriate for a numerical solution. To that end, a truncation of the state space is necessary, such that < , = 1, . . . , . We denote the truncated state space as X = x ∈ N 0 | < , = 1, . . . , . The choice of a box domain truncation is not strictly necessary, however, it is beneficial for the TT format used in the following. All states in X can be uniquely indexed as x(i), where i = ( 1 , . . . , ) ∈ N and ( ) = − 1. Accordingly, = 1, . . . , .
Using the truncated state space defined above and for a given time instance , the PMF ( , x) can be represented as a multidimensional array p ( ) ∈ R 1 ×···× , the elements of which are given as In the following, we shall refer to such multidimensional arrays as tensors 28 . The evolution equation (3a) can then be written in tensor format, such that where A ∈ R ( 1 ×···× )×( 1 ×···× ) is a tensor-operator, also called a tensor-matrix, that acts on the tensor p( ). Tensoroperators can be seen as generalizations of the commonly employed matrix-based operators to more than two dimensions. The elements of the CME tensor-operator are given as where j i = 1 1 · · · , with denoting the Kronecker delta. Accordingly, the product between a tensor-operator and a tensor, the result of which is elementwise given as can be seen as a generalization of the standard matrix-vector product.
The complexity for storing for storing the tensor p( ), which contains all state probabilities at time instance , is O ( ), where = max { }. Therefore, even if a truncated state space is employed, the storage needs can become intractable even for a relatively small number of species. The exponential dependence of storage needs to the number of species is one manifestation of the so-called curse of dimensionality 30 . As a remedy to this problem, low-rank tensor formats 28 can be employed, such as the TT decomposition 27 discussed next.
In the following, a commonly employed operation between tensors, tensor-operators, or matrices, is the Kronecker product. The Kronecker product between two tensors x ∈ R 1 ×···× and y ∈ R 1 ×···× is defined elementwise as The definition holds also for matrices and vectors, as they can be interpreted as two-dimensional and one-dimensional tensors, respectively.

A. Tensor-train decomposition
As discussed in the previous section, the size of a tensor scales exponentially with the number of dimensions, equivalently, number of species in this work. To mitigate the curse of dimensionality, we employ tensor decompositions, resulting in tensor formats whose sizes scale linearly with the number of dimensions, instead of exponentially. Several tensor decompositions have been developed over the last decades, resulting in better-scaling tensor formats 28 . In this work, we focus on the so-called tensor-train (TT) decomposition, which combines linear complexity scaling with respect to the dimensions and computational stability 27 .
A tensor x ∈ R 1 ×···× is said to be in the TT format if it can be elementwise written as where the three-dimensional tensors g ( ) ∈ R −1 × × are called the TT-cores and R = (1, 1 , ..., −1 , 1) are called the TT-ranks. The storage complexity of a tensor in the TT format is reduced to O ( 2 ), i.e. it is linear with respect to the number of dimensions . Moreover, all basic multilinear algebraic operations scale also linearly with the dimensions and polynomially with the TT-ranks 27 . It should be noted that the TT-ranks grow after a multilinear algebraic operation is performed in the TT-format. Therefore, a rank-reduction procedure is performed after the operation, called rounding 27 . Rounding decreases the TT-rank of the tensor while maintaining a prescribed accuracy and its complexity is O ( 3 ).
An exact TT decomposition of a full tensor typically leads to high ranks R, hence, to high storage needs and computational costs as well. However, in many cases, using a low-rank TT approximation A ≈ A is sufficient . If the full tensor is available, a low-rank TT approximation can be efficiently computed with sequential singular value decompositions (SVDs) of auxiliary matrices A ( ) ∈ R ×( 1 2 ··· −1 +1 ··· ) 27 . For tensors defined implicitly by multidimensional functions, e.g. similar to (5), efficient interpolation-based TT approximation methods are available 31,32 , in which case the assembly of the full tensor is not necessary.
A tensor-operator A ∈ R ( 1 ×···× )×( 1 ×···× ) can be similarly decomposed using the TT format, in which case it is elementwise written as: where the TT-cores g ( ) ∈ R −1 × × × are now fourdimensional. The product between a tensor-operator and a tensor, e.g. the Kronecker product defined in (8), can be efficiently computed directly in the TT-format 27 .

B. Linear system solutions in the TT format
Of particular interest in the context of this work is to solve efficiently a multilinear system Ax = b, where A ∈ R ( 1 ×···× )×( 1 ×···× ) and x, b ∈ R 1 ×···× are given in the TT format. Iterative Krylov subspace solvers, e.g. based on the conjugate gradient (CG) or generalized minimal residual (GMRES) methods, can be generalized to multilinear systems given in the TT format 33 . However, without preconditioning, the number of iterations is large and leads to a large number of rounding operations in order to keep the TT-rank small, thus resulting in an undesirable computational cost 33,34 .
An alternative approach is to minimize the norm of the system's residual with respect to the cores of the solution's TT-decomposition. The corresponding minimization problem reads where · F denotes the Frobenius norm. The feasible set R 1 ×···× is restricted to a subset of R 1 ×···× which contains all -dimensional tensors that can be represented in the TT format with TT-ranks ≤ max , = 1, . . . , − 1. In this case, the minimization problem is nonlinear with respect to the core tensors g ( ) of the TT representation of x given in (10).
The nonlinear minimization problem (12) can be solved using the alternating least squares (ALS) method 35,36 . The method fixes all cores but one, thus resulting in a quadratic optimization problem for the minimizing core. The process is then repeated iteratively, minimizing one core at a time until the objective function decreases to a sufficiently small value 34 . The main drawback of the ALS method is that the TT ranks must be given a priori. This can be avoided by minimizing over two consecutive cores instead of just one, an idea that was first introduced by the Density Matrix Renormalization Group (DMRG) algorithm 37 and which was later used in the Alternating Minimal Energy (AMEn) method for solving high-dimensional multilinear systems 29,38 . In order to bring the minimization problem to a quadratic form, a socalled supercore g ( , +1) ∈ R −1 × × +1 × +1 is first defined as thus removing all information regarding the rank . Then, the TT-representation of x takes the following form (in elementwise notation): The minimization then proceeds similar to the ALS method, where now one supercore is minimized in each iteration. After the optimization procedure is completed, the supercore is divided into two separate TT-cores, e.g. by means of SVDs of auxiliary matrices 31 . Then, the rank is not a priori given, but identified as part of the supercore's separation.

C. Low-rank TT representation of the CME operator
The CME operator in (7) can be represented in the TT format using a sum of rank-1 tensors, without ever assembling the full tensor-operator 10,39 . For a reaction of type (1) with the propensity function (4), one can use the separation Then, the CME tensor-operator defined in (7) can be written as the difference between two tensor-matrices B, C ∈ both of which admit exact rank-1 TT decompositions, with the corresponding four-dimensional TT-cores g ( ) (for B) and h ( ) (for C) given by where the indicator functions I are defined as I ( ) = 1 if ≤ and 0 otherwise. Then, for the th reaction, the maximum TT-rank of the tensor-operator A is at most equal to 2. If the operators coming from different reactions are added, the maximum TT-rank is at most 2 . However, in practical examples, rank rounding with a very high accuracy, e.g. A − A F ≈ 10 −12 ||A|| F , yields a TT approximation A ≈ A with much lower TT-ranks.

D. Solving the CME in the TT format
The CME can be solved numerically in the TT-format using finite differences 40 . This limits the choice of the time discretization to classical implicit and explicit schemes 29 . An alternative method 10,29 is to employ a basis representation of the time-dependent solution over an interval [0, Δ ], such that where ( ) are basis functions for the interpolation in time.
In this work, we employ a Chebyshev basis, however, other options are also possible, e.g. hat functions or Lagrange polynomials. By including time as an additional dimension next to the states, a ( + 1)-dimensional tensor is obtained. We then perform a Galerkin projection to recover the degrees of freedom for the entire subinterval, by solving the system where S is the stiffness matrix, P is the mass matrix 29,41 , I ∈ R × denotes an identity matrix, and I n = I 1 ⊗ · · · ⊗ I . The system (19a) can then be solved as described in section III B. Due to the increased complexity of the solution over long simulation times, one can divide the time domain and apply the presented method to the individual subdomains by taking the end state as initial condition for the next subinterval. For a constant subinterval length and if the solution is smooth, the convergence is exponential in 29,41 . An error indicator is represented by computing the norm of the residual of (19a) for an enriched basis 29 where M and f are the tensors from (19b) and (19c) constructed for = 2 . The operator Q interpolates the solution on the finer time-domain basis with = 2 . This can be used to adapt the subinterval length if the indicator ( , Δ ) is larger than a prescribe tolerance tol 29 where Δ is the modified subinterval length. One bottleneck is the prescribed accuracy of the TT-solver, which, as shown in the numerical results, acts like a lower limit for the error.
In the same framework, one can also include classical implicit time stepping schemes, such as the Crank-Nicolson and the implicit Euler, by appropriately choosing the stiffness and mass matrices 29 . The motivation behind this is that the time dynamics is captured in the low rank structure of the tensor, reducing the storage complexity. Moreover, as observed from numerical experiments, the runtime time is also reduced.

E. Quantized TT CME solver
One way to optimize the aforementioned TT-based CME solver is to employ the so-called quantized tensor train (QTT) decomposition 42 . The QTT format has proven to further increase the storage and computational efficiency of the TT representation by reshaping the tensors into higher dimensional ones while reducing the mode sizes. Let x ∈ R 1 × 2 ×···× be a tensor with log 2 ∈ N, = 1, ..., , i.e. with mode sizes that are powers of 2. If the tensor x is represented in the TT format, then reshaping it into a ( log 2 )-dimensional tensor can be easily achieved by performing the TT-decomposition on the individually reshaped cores. Let x ∈ R 1 ×···× be a tensor with log 2 ∈ N. The tensor admits a rank R TT-decomposition with the cores g ( ) . The quantization process implies reshaping the individual cores to tensors of shape −1 × 2 × · · · × 2 × and then the TTdecomposition of the reshaped cores is computed. The resulting cores correspond to the QTT decomposition of the tensor 32 . In the case of the tensor-matrices, the procedure is similar. Compared to the computational complexity of the solver, the complexity of the transformation between TT and QTT can be neglected. This procedure has been proven to be effective for reducing the storage requirements and the computation time for solving the CME in the TT format 10,11 . If the ranks of the QTT decomposition remain bounded, the storage complexity O ( log 2 ) 43 . Moreover, the solver benefits from the linearity with respect to the number of dimensions.

A. Parameter-dependent CME
We now consider a parameter-dependent CME 10 , described by where θ ∈ R denotes the parameter vector. The parameter vector θ is assumed to take values in the tensor-product space . Solving the CME for one parameter realization θ (l) ∈ Θ, l = ( 1 , . . . , ), yields the conditional PMF x|θ (l) . If instead of a fixed initial PMF, one starts with the joint PMF 0 (x, θ), the solution will be the joint PMF over the discretized time interval. Since our goal is to compute the conditional/joint PMF over the entire parameter space, we use a basis expansion 44 to describe the parameter dependence, such that In this work, B-spline basis functions are chosen for the parameter dependence 45 .
In order to retrieve the degrees of freedom, a Galerkin formulation is used to derive a multilinear system with respect to the full tensor. We choose the test functions q(θ) from the same space, which yield the formulation where ·, · is an inner product with respect to θ. One can then derive the multilinear system with the mass tensor-matrix and the stiffness tensor-matrix The mass matrix can be easily constructed as a rank-1 TToperator using the individual mass matrices of the univariate bases. On the contrary, the stiffness matrix requires the evaluation of the parameter dependent CME operator over a tensor product quadrature grid Θ = ( 1 ) where w is the weight tensor andĀ ik,jr = A θ (r) k r . The tensor-matrix K admits a TT representation or approximation, assuming that the evaluation of the CME operator can also be represented or approximated in the TT format.
A direct construction of the extended operatorĀ can be easily accomplished if each parameter affects only one reaction. For example, this situation occurs if the parameters are the reaction rates, i.e. θ = ( 1 , 2 , . . . ). Then, the operator is extended using the Kronecker product, such that where A ( ) is the CME operator corresponding to reaction for a unity reaction rate. We note that parameter dependence is not necessarily restricted to the reaction rates. Equation (29) can be extended to accommodate other types of parameter dependencies. If the propensity functions have a representation as in (15b) and every depends on at most one parameter, then the individual CME operator for every reaction evaluated on the grid Θ can be expressed as a sum of rank-1 TT tensors and then rounded to eliminate overshooting ranks. Moreover, even the structure of the reaction network can be incorporated as a parameter, however, this is out of scope for the present work.

B. Filtering and smoothing in the TT format
One relevant inference task in computational biology applications is the filtering and smoothing of observations, for example, in order to estimate the dynamics of genes that are measured indirectly via a fluorescent reporter protein. We consider o state observations y ( ) o =1 , which are sampled at discrete time steps o =1 . The observations are considered to be realizations of the random variables Y ( ) o =1 , which are assumed to be conditionally independent given the latent states X ( ) o =1 . Thus, the observation model is assumed to be dependent only on the current state. Its probability density function (PDF) is denoted with Y |X (y|x). In practice, Y |X often corresponds to additive Gaussian or multiplicative lognormal noise. However, the presented framework is not limited to these particular cases.
The conditional probability Pr(X ( ) = x|y (1) , ..., y ( ) ) for = max{ ∈ N| < } satisfies the unconditional master equation 46 with the reset conditions where − (x) represents the left approaching limit → , < , and = x − (x) Y |X y ( ) |x . If all observations are taken into consideration, we deal with the smoothing case. The PMF˜(x) of Pr(X ( ) = x|y (1) , ..., y ( o ) ) can be factorized as˜( where the PMF satisfies the backward master equation where (x, o ) = 1 is the terminal condition and where = min{ ∈ N| > }. The PMF˜(x) satisfies the evolution equation with the time-varying smoothing propensity functions The method is also known in the literature as the forwardbackward algorithm 47 and can be interpreted as a messagepassing algorithm in a Hidden Markov Model (HMM). Moreover, it can be efficiently performed in the TT-format, as shown in Algorithm 1. The PMF (x ( ) |y (0) , ..., y ( −1) ) is the forward message and is denoted by the tensor a ( ) ∈ R 1 ×···× . The prediction step is performed by solving the CME over the interval [ −1 , ] with the initial condition a ( −1) . The observation is used to construct a tensor p obs ∈ R 1 ×···× with p obs i = Y |X y ( ) |x(i) . If every species is observed independently, i.e. the observation model can be factorized, the tensor p obs is rank-1 and can be expressed as a Kronecker product. For the backward pass, the message is y ( +1) , ..., y ( o ) |x ( ) and is represented with the tensor b ( ) . The CME is now solved using the transposed operator A and with the initial condition b ( +1) * p obs , where * denotes the elementwise multiplication operation. The last step is to multiply and normalize the forward and the backward messages in order to get the conditional x ( ) |y (0) , ..., y ( o ) . In the presented framework, we a get smoother distribution only at the observation points. In order to have information in between the observations, prediction steps must be added.

Algorithm 1 Forward-backward algorithm in the TT format.
Solve the CME with a ( −1) as initial condition. Compute p obs for y ( ) in the TT format.
Compute p obs for y ( +1) in the TT format. Solve the CME with operator A and initial condition

C. Bayesian parameter inference in the TT format
Consider an observation sample y ( ) o =1 satisfying the assumptions detailed in section IV B, but now connected to a realization of the random process X ( ,θ), whereθ is the parameter vector governing the system. Given a prior distribution (θ), we are interested in computing the Bayesian parameter posterior (θ|y (0) , ..., y ( o ) ). By viewing the parameters as part of an augmented process {X( ), θ( )} ≥0 , the distribution of the parameters over the parameter space P can be obtained by performing filtering over the joint space of states and parameters. The prediction step is given by where | −1 is the transition PDF and implies solving the parameter-dependent CME from −1 to . Next, the update step reads In the TT-format, this parameter inference procedure can be implemented as follows.
The posterior x ( ) , θ ( ) |y (0) , ..., y ( ) is represented by the tensor p ∈ R 1 ×···× ×ℓ 1 ×···×ℓ , such that The prediction step involves solving the parameter-dependent CME with x ( ) , θ ( ) |y (0) , ..., y ( ) as initial condition, returning the predicted PMF p (pred) as a result. The resulting tensor is multiplied with the observation tensor at step +1 and normalization is performed to get the new joint distribution where = il p ( +1) il w l is the normalization constant and comes from numerically integrating over the parameter space with the integration weights tensor w. A prior distribution θ (0) and an exact knowledge of state for = 0 is used for the first step, where x (0) , θ (0) |y (0) = x (0) ) (θ (0) . Once all observations have been used, the state is marginalized to obtain the posterior over the parameter space as θ ( ) |y (0) ..., y ( ) = ∑︁ which is computationally efficient if performed in the TTformat. The procedure is summarized in Algorithm 2. Algorithm 2 Parameter identification for the parameterdependent CME in the TT-format.

V. NUMERICAL RESULTS
The following numerical experiments aim to showcase the advantages of the proposed framework in terms of accuracy and computational efficiency. With respect to the latter, storage requirements and computation times are reported for every individual test case. All tests were run on a workstation with a 10-core Intel Xeon processor with 64GB of RAM. For the TT operations, the ttpy Python package was used in combination with the Intel MKL library.

Two-dimensional simple gene expression model
We first validate the developed TT ODE solver and perform a convergence study based on the two-dimensional simple gene expression model 48 . The four reactions are presented in Table  I. The initial state is x (0) = (2, 4) with probability 1. The CME is solved in the time interval [0, 1024] with a subinterval size of 128, where arbitrary time units are used.
The first validation test concerns the maximum relative error of the solution to the CME, computed with the method presented in III D, in dependence to the time dimension of the basis representation from (18) inside one subinterval. The maximum relative error is given as max (ref) end (x) − end (x) /max end (x) , where end = 1024 and the reference solution (ref) end (x) is computed by numer- ically solving the CME without the TT-decomposition for a very fine time grid. We note that no truncation of the TTrank was performed during this validation test, and the relative residual which signifies the convergence of the TT-solver was set to 10 −13 . The results of this first validation set are presented in Fig. 1, where the employed time-interpolation on the Chebyshev polynomial basis, as shown in (18), is compared against classical time-stepping methods such as implicit Euler and Crank-Nicolson finite-difference schemes. As would be expected, irrespective of the time-stepping method, the TT-based CME solver yields increasingly more accurate results for finer discretizations of the time interval. Moreover, as expected from theory, the convergence of the explicit Euler scheme is O (Δ ), accordingly O (Δ 2 ) for Crank-Nicolson 49 . In the case of the Chebyshev polynomials, an exponential convergence is observed and the error stagnates for a basis of only = 8 polynomials.
The combined impact of the time step and the maximum residual of the TT solver is investigated next for the Chebyshev basis representation, with the results presented in Fig. 2. As can be observed, for a fixed , the accuracy of the TT-solver's solution stagnates after a certain value of the maximum residual. The stagnation point is actually dependent on the value of , i.e. smaller values allow for smaller maximum residuals. Hence, the step size and maximum residual must be chosen according to the desired accuracy of the TT-solution, also taking into consideration the related computational cost.

Four-dimensional SEIR model
For the previously considered two-dimensional model, the reference solution could easily be obtained using an ODE solver. We now consider a four-dimensional virus spreading model, namely, the SEIR model 50 Table II. The initial condition is x(0) = (50, 4, 0, 0) and the state space is truncated to n = ( 1 , 2 , 3 , 4 ) = (128, 128, 64, 64). The simulation was performed over the interval [0,8]. For the TT-solver, the subinterval length is equal to 0.5 and the subinterval basis dimension is = 8. The reference solution is computed by numerically solving the CME without the TT-decomposition for a very fine time grid.
Even if a sparse format is employed, ≈ 1.3 GB RAM are needed to store the CME operator for the reference solution. In comparison, using the TT-format, the CME TT-operator has the TT-ranks R = (1, 5, 6, 3, 1), resulting in storage needs of only ≈ 2.32 MB RAM, i.e. 0.17% of the storage space needed by the standard solver. If the operator is reshaped in the QTT format, the storage requirements decreases to ≈ 42 KB. Moreover, using the solver with the QTT format, the solution is obtained in ≈ 180 s, which is a considerably smaller computation time than the one needed for the reference solution, i.e.   ≈ 12600 s. Without the use of the QTT format, the number of solver iterations and the computation time increase by one order of magnitude. Fig. 3 shows the time evolution of the marginal distribution, as well as the pointwise error at the end of the simulation compared to the reference marginals. Finally, at end we obtain the relative errors Hence, the TT-solver yields accurate solutions at significantly reduced execution times and with tremendous storage savings compared to the standard solver. Indicatively, the solution at = 8 requires only 2.5 MB storage, which is approximately 0.4% the storage requirement of the reference solution.
One issue is the ordering of the species. If species that are highly correlated are apart from each other in the train, the ranks in between must carry the information and therefore the overall rank structure increases. This can also be observed in the representation of the CME operator. In this example, the S,E,I,R ordering is chosen so that most of the reactions involve species that are neighbors in the chain.

B. Filtering and smoothing
As discussed in IV B, state filtering and smoothing can be performed in the TT-framework. We consider here the SEIR model presented in section V A 2. We assume o = 33 equidistant observations with Δ = 0.3125. The time interval is now chosen as [0, 10]. The realizations are obtained using the Stochastic Simulation Algorithm (SSA) 4 (blue continuous lines in Fig. 4). The noise is assumed to be lognormal with variance 0.1 for , , and , and 0.05 for (black × symbol in Fig. 4).
The TT-based forward-backward Algorithm 1 presented in section IV B is used to perform state filtering and smoothing.The state is truncated to (128, 128, 64, 32) and the Chebyshev basis is used for the time dependency. The runtime of the SEIR experiment is 12 minutes for a solver accuracy of 10 −6 in terms of relative TT-solver residual. As the estimated state, we compute the expected value of the distribution given by x ( ) |y (0) , ..., y ( o ) (red discontinuous line in Fig. 4) and the corresponding standard deviation (grey envelope in Fig. 4). In this case, the incorporation of the observation model in the TT framework is beneficial to the reduction of the error since it acts like a reset condition. This can be observed in the decrease of the TT-rank after the multiplication with the tensor corresponding to the observation operator. The numerical experiment shows a decrease in the rank of up to 3 times, as can be observed in Fig. 5.
One more significant advantage of the TT representation is that the storage of the messages decreases dramatically compared to the full tensor representation, as the total storage needed is ≈ 150 MB for the forward propagating messages and ≈ 190 MB for the backward propagating messages. In addition to that, computing the moments of the smooth distribution consists of multiplication with rank-1 TT tensors. Moreover, the lognormal observation model is also translated to a rank-1 tensor. The presented results are performed in the QTT format, however the same test was performed without quantization. For the given state truncation, using the QTT format results in an acceleration by more than an order of magnitude in computation time, compared to the standard TT-format.

Simple gene expression
We now use Algorithm 2 to identify all four parameters θ = ( 1 , 2 , 3 , 4 ) = ( 1 , 4 , 3 , 4 ) of the simple gene expression model from a noisy sample with o = 64 observations. In this case, the solver uses the QTT format, the observations are taken equidistantly every 4 time units, and the parameter priors are independent, truncated Gamma distributions, chosen such that they do not match the actual parameter. The parameter domain R 4 + is restricted to ∈ [0, 6 ]. As a reference, a sample of size 5 · 10 5 is drawn from the posterior using the Metropolis-Hastings algorithm. The CME in this case is solved in the full format with the built-in Python ODE solver.
With respect to the parameter dependence approximation, the basis of choice in this case is quadratic B-splines with equidistant knots scaled to the parameter range. The dimension of the individual univariate bases is 64. For the time integrator, a Chebyshev basis of dimension = 8 is used with an subinterval size of 0.5 time units. The runtime is in this case ≈ 21 minutes with a maximum storage requirement of ≈ 9.2 MB for the joint over state and parameters (represented by a 6D tensor with mode size 64). The storage requirement for the extended CME operator in the QTT format is only 128 KB. Storing the tensor in the full format is intractable on standard machines even for this 2D example. For the given sample size, the Metropolis-Hastings algorithm run for ≈ 1.5 days.
Since the posterior over the parameter space is fourdimensional, hence, not easy to visualize, the marginals for the individual parameters are computed and compared to the 1D histograms of the posterior sample, for the purpose of validation. The results are presented in Fig. 6, where it can be observed that the posterior modes offer a reasonable approximation of the true values, the latter being marked with red vertical dashed lines. As a further characterization of the posterior, we compute its expected value, variance, and the mode of the PDF: Since there is no analytical estimate for the combined error of the method, several runs with different hyperparameters are performed for this model. First, the accuracy of the solver in terms of relative residual, here denoted with , is varied and the relative error of the TT-solver's solution is analyzed, first with respect to the MCMC solution, and second with respect to the most accurate solution of the TT-solver, i.e. for = 10 −6 . The corresponding results are presented in Table III, where the simulation time and memory requirement for storing the joint in the TT-format are also reported. As can be seen from the table, the accuracy of the MCMC solution is reached already for a TT-sover accuracy of = 10 −4 . Looking at the memory consumption and the execution time, they both increase for a higher TT-solver accuracy, which is expected since more solver iterations are needed to reach the desired residual.  Additionally, the dimension of the parameter basis has been investigated. If the tensor product basis is constructed using univariate B-spline bases of dimension 16, the prior can be well approximated. However during the inference, the decrease of the variance of the joint leads to an incapability to resolve the posterior, since a finer basis is needed. If the discretization is increased to 32 for every parameter, the oscillations become negligible.
As a conclusion, the limiting factor in the inference framework seems to be the accuracy of the TT-solver. For the purpose of inference, however, a relative residual value of = 10 −5 seems to be sufficient for obtaining an accurate approximation and an acceptable computational time.

Gene expression model with feedback
The second model where the parameter identification is employed is the 3-stage gene expression model with feedback    Table IV. A realization is drawn using the SSA and equidistant sampling is performed with additive Gaussian noise (see Fig. 8).
The parameters to be identified are in this case θ = ( 1 , ..., 5 ) and the parameter space is bounded to [0,5 ]. For the priors, we choose again independent Gamma distributions, which are truncated within the parameter space. The parameter dependence is approximated using a tensor product basis of univariate quadratic B-splines with dimension 64. The tolerance of the TT-solver is set to 10 −5 in terms of relative residual.
The results are reported in Fig. 7 where we can see the visual match between the histograms and the 1D marginals on the diagonal. The expected value and variance of the posterior are:  10 −2 for the variance. The limiting factor is in this case the small MCMC sample size. Using the TT solver, the execution time for this test case is 50 minutes. Regarding storage needs, only ≈ 12 MB of RAM are used. As a comparison, the MCMC simulation took approximately 2.5 days to complete for a sample size equal to 5 · 10 5 .

SEIQR model
The model considered now is an extension of the SEIR model presented in the filtering section and has one additional species: quarantined (Q). The modified reactions are found in Table V. We infer in this case the parameters θ = ( 1 , 2 , 3 , 4 ) from 45 observations affected by lognormal noise (see Fig.  9). The species Susceptibe and Exposed are observed with a higher degree of uncertainty while Quarantined and Recovered are observed exactly. The execution time for a TT-solver accuracy of = 10 −5 is ≈ 55 minutes with a maximum poste-rior size in the QTT-format of ≈ 30 MB. As a comparison, the chosen state truncation of (128, 64, 64, 32, 32) would require ≈ 4.2 GB only for storing the state for one parameter realization. The storage complexity for the parameter-dependent CME operator in the QTT format is ≈ 200 KB.
For this setup, the variance of the approximated posterior (see Fig. 10) is two orders lower than the prior for the first parameter and one order lower for the second and third parameters. This implies a higher confidence in the reconstruction of the parameters, which also indicates the need for a denser basis. In this case, choosing less than 64 points per parameter would lead to oscillations and inability to infer the posterior. This can be overcome by adaptively reducing the bounds of the parameter domain and re-interpolating the posterior on the new basis.

VI. CONCLUSION
We presented a method based on the TT decomposition to solve the CME, either in its standard form or including parameter dependencies, and approximate the joint distribution over the state-parameter space, including the time dependency as well. Using the considered TT-framework, inference tasks such as state smoothing and parameter identification can be performed accurately and efficiently. The proposed TTframework is also applied to solve the inverse problem of identifying the values of the parameters governing the system under investigation from noisy state observations. The resulting numerical approximation of the posterior PDF over the truncated parameter space can be efficiently stored and manipulated in the TT format.
A series of numerical experiments with a simple gene expression model and with the SEIR virus model show clearly that the state-time TT-approximation reduces the storage needs of the CME to a mere fraction of what a standard CME solution method requires. Moreover, by performing multilinear algebraic operations in the TT-format, the execution time is significantly reduced as well. Similar benefits are observable in the context of inference tasks, where the proposed TT-based filtering, smoothing, and parameter identification approaches yield accurate results for a significantly reduced computational cost.
While standard inference procedures for the single trajectory setting such as MCMC require repeated solutions of the CME for different parameter configurations, the joint approach presented here requires only one forward pass on the augmented state space. One drawback of the parameter space discretization is that it can cause problems when the posterior is much more concentrated then the prior. As demonstrated on the SEIQR model, this can be overcome by dynamically adapting the basis.
In this work, we have focused on inferring the rate constants of structurally known models from a single trajectory. An important direction for future research is to extend the approach to multiple trajectories with shared parameters. Another interesting direction is to consider different types of uncertainty, e.g. the involved species or the types of the reaction. Since the presented method is fully Bayesian, this could be realized by scoring different candidate structures via Bayesian model comparison.

DATA AVAILABILITY
The data and code that support the findings of this study are freely available at https://github.com/ion-g-ion/paper-cme-tt.