Estimation and uncertainty of reversible Markov models

Reversibility is a key concept in Markov models and master-equation models of molecular kinetics. The analysis and interpretation of the transition matrix encoding the kinetic properties of the model rely heavily on the reversibility property. The estimation of a reversible transition matrix from simulation data is, therefore, crucial to the successful application of the previously developed theory. In this work, we discuss methods for the maximum likelihood estimation of transition matrices from finite simulation data and present a new algorithm for the estimation if reversibility with respect to a given stationary vector is desired. We also develop new methods for the Bayesian posterior inference of reversible transition matrices with and without given stationary vector taking into account the need for a suitable prior distribution preserving the meta-stable features of the observed process during posterior inference. All algorithms here are implemented in the PyEMMA software — http: // pyemma.org — as of version 2.0. C 2015 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution 3.0 Unported License. [http: // dx.doi.org / 10.1063 / 1.4934536]


I. INTRODUCTION
Markov models, Markov state models (MSMs), or Master-equation models are a powerful framework to reduce the great complexity of bio-molecular dynamics to a simple kinetic description that represents the underlying transitions between distinct conformations [1][2][3][4][5][6][7] . These models allow us to analyze the longest-living (metastable) sets of structures 8 , the effective transition rates between them 9,10 , the kinetic relaxation processes and their relationship to equilibrium kinetics experiments 7,[11][12][13][14] , and the thermodynamics and kinetics over multiple thermodynamic states [15][16][17][18] . A key advantage of MSMs is that they are estimated from conditional transition statistics between states, and they thus do not require the data to be in global equilibrium across all states. As a result, they are an excellent tool to integrate the data of multiple simulation trajectories that have been run independently and from different initial states into a single informative model [19][20][21] .
A variety of complex molecular processes have been successfully described using MSMs.
There are two key steps in the construction of a MSM. At first a suitable discretization of the continuous conformation space has to be obtained. In most cases no good a-priori discretization is known and the discretization has to be found based on the simulation data. The appropriate choice of discretization is a topic of ongoing research 24,31,32 . The error incurred by the discretization and by the subsequent approximation of the jumpprocess as a Markov process can be systematically controlled and evaluated 7,33-35 . a) Equal contribution b) Electronic mail: frank.noe@fu-berlin. de In the second step one estimates the transition probabilities between pairs of states based on the transition statistics. The most common approach to estimating Markov models from data is by means of a Bayesian framework. One first harvests the transition counts, c ij (τ ) from the data, i.e. how often trajectories were found in discrete state i at some time t and in discrete state j at some later time t + τ . The parameter τ is called lag time and is crucial for the quality of the Markov model 7,33 . Next, one computes the transition matrix either by maximizing the likelihood, i.e. the probability over all possible Markov model transition (or rate) matrices that may have generated the observed transition counts 7,36,37 ; or by sampling Markov models from the posterior distribution [38][39][40][41][42] . A maximum likelihood estimate gives a single-point estimate, i.e. a single Markov model that is "most representative" given the data. However, if some transition events are rare compared to the total simulation length -and this is the typical case in molecular dynamics simulation -this maximum likelihood model might be very uncertain and thus far away from the model that the one would converge to by increasing amount of simulation data. The Bayesian posterior ensemble is a natural approach to quantify such statistical uncertainties and thus to make meaningful comparisons between a Markov models obtained from different sets of simulations, or to experimental data.
A key property of molecular dynamics at thermal equilibrium, and a necessary consequence of the second law of thermodynamics, is microscopic reversibility of the equations of motion. This property is ensured by many simulation procedures 43,44 and carries over to detailed balance between discrete states, i.e. formally leads to a (time-) reversible Markov model. A reversible Markov model is not only physically desirable, it offers statistical advantages as it has only about half as many independent parameters compared to a nonreversible model, and it allows the equilibrium kinetics to be analyzed in a straightforward and meaningful manner. Furthermore imposing detailed balance with respect to a given stationary vector can be used to aid the efficient estimation of rare-event processes from MSMs 45 .
Algorithms imposing detailed balance during likelihood maximization have been discussed in 7,16,37,46 . First methods for the sampling the posterior distribution of reversible transition matrices have been suggested in 39 and later in 47 . A method working with natural priors for reversible chains was proposed in 48 . The sampling of transition matrices reversible with respect to a fixed stationary distribution was also presented in 39 , while a Gibbs sampling algorithm with a significantly improved convergence rate has been developed in 42 . Ref. 49 discusses methods for goodness-of-fit tests for Markov chains.
This article is deliberately broad and presents new concepts, insights and algorithms for reversible Markov model estimation in general, maximum likelihood estimators and Bayesian estimators that mutually benefit from each other. For this reason we first give a survey of principles and consequences of reversible Markov models. We then extend the framework of maximum likelihood estimation of transition matrices by giving a simplified maximum likelihood estimator (MLE) for reversible transition matrices and a new estimator for reversible transition matrices with a fixed equilibrium distribution. The main part of the paper comprises new algorithms for the full Bayesian analysis of the posterior ensemble of reversible Markov models. As yet, three fundamental problems have not been satisfactorily solved: (i) How can one harvest statistically uncorrelated transition counts from trajectories in which subsequent transitions are correlated, so as to give rise to meaningful uncertainty intervals? (ii) Which prior should be used in a Bayesian analysis so as to get error intervals that envelop the true value even for Markov models with many states? (iii) How can we design efficient sampling algorithms for the reversible posterior ensemble, i.e. algorithms that allow to quickly compute reliable error bars for Markov models with many states? In this paper we discuss (i) give a rather complete treatment of problems (ii) and (iii). Efficient sampling algorithms are derived for reversible Markov models and reversible Markov models with fixed equilibrium distribution.

II. REVERSIBLE MARKOV MODELS
In this section we will show that microscopic reversibility carries over to the discretized situation and discuss the desirable properties of a reversible Markov state model.
A. From microscopic reversibility to discrete-state detailed balance Let µ(x) denote the equilibrium distribution on the microscopic degrees of freedom x ∈ Ω, e.g. all-atom coordinates of the system of interest, and let p τ (x, y) denote the conditional transition density of the MD implementation. p τ (x, y) is the probability that the system is found in state y at time t + τ given that it has been in state x at time t. The MD implementation fulfills microscopic reversibility if the following detailed balance equation holds for all pairs of states x, y ∈ Ω. Hence the terms "detailed balance" and "reversible" are equivalent in our context. Since µ(x) p τ (x, y) is the unconditional probability to find the transition (x, t) → (y, t + τ ), Eq (1) means that the system is on average time-reversible -the absolute number of transitions from x to y is equal to the reverse. Microscopic detailed balance is desirable to have in any MD implementation when the aim is to perform simulations in thermodynamic equilibrium. If the (1) would be violated, that would imply the existence of cycles x → y → z → x along which there is a net transport. Since such cycles could be used to generate work, their existence in a system that is driven by purely thermal energy would be inconsistent with the second law of thermodynamics. Dynamical models that are commonly employed in MD implementations fulfill detailed balance. Brownian (overdamped Langevin) dynamics fulfills detailed balance. Hamiltonian and non-overdamped Langevin dynamics fulfill generalized detailed balance with respect to momentum inversion in phase space, but when integrating over the distribution of momenta they do fulfill ordinary detailed balance in position space 50 . In practice, some finite time-stepping integrators do not obey exact detailed balance with respect to the Boltzmann distribution, but we here consider that the MD implementation has been chosen such that detailed balance is at least approximately fulfilled. Now suppose that the state space Ω is partitioned into non-overlapping subsets S 1 , ..., S n that we shall call discrete states here. Each set has an equilibrium probability given by and the transition density gives rise to a discrete state transition matrix P (τ ) with entries Using (2) and (3) and microscopic detailed balance, (1), it is straightforward to verify that Note that (4) holds independently of the choice of the lag time τ . Moreover, (4) implies that π is the equilibrium probability vector of P (τ ). By defining the diagonal matrix Π = diag(π 1 , ..., π n ), we can alternatively write (4) as a matrix equation: constraints dof ≈ none n(n − 1) n 2 reversible 1 2 n(n − 1) + n − 1 1 2 n 2 reversible, fixed π 1 2 n(n − 1)  As a result, if the microscopic dynamics are reversible, the Markov model transition matrix must also be reversible. However, a direct estimate of P from a finite amount of simulation data cannot be expected to fulfill (4) exactly. Thus, the principle validity of detailed balance motivates us to enforce (4) in the process of estimating P . Enforcing detailed balance helps to reduce the statistical error of an estimator for P because it reduces the number of independent variables roughly by approximately one half (see Table (I)). However, there are other consequences of having (4): Given detailed balance, we can compute the molecular equilibrium kinetics in a physically meaningful way and employ some useful analysis tools that are not defined for nonreversible Markov models. Moreover, we can employ more efficient and robust matrix algebra routines when exploiting that P is a reversible matrix.

B. Eigenvalues and eigenvectors
Many methods to analyze the molecular kinetics based on a Markov model transition matrix rely on the eigenvalue decomposition of P . Using the diagonal eigenvalue matrix Λ = diag(λ 1 , ..., λ n ) we can formulate a right eigenvalue problem with right column eigenvectors R = (r 1 , ..., r n ), r i ∈ R n and left row eigenvectors L = (l 1 , ... l n ) : From (7), we can obtain a generalized eigenvalue problem: Π is symmetric positive definite and as a result of detailed balance, ΠP is symmetric. Hence, all eigenvalues λ 1 , ..., λ n are real, and the eigenvectors are orthogonal with respect to the equilibrium distribution 51 : where we have used the weighted scalar product u, v π = i π i u i v i . We can make R orthonormal by scaling an arbitrarily obtained eigenvector r i by r i , r i Inserting the detailed balance formulation (6) into the decomposition (7) immediately gives: which is a left eigenvalue problem with the choice Thus, detailed balance establishes a 1-to-1 relation between the left and the right eigenvectors. We can decompose the transition matrix into its spectral components by just using one set of eigenvectors and the equilibrium distribution, such as: a. Example 1: Consider the following reversible 3× 3 transition matrix, Suppose we generate a Markov chain of length 20 starting from state 1, resulting in the count matrix at lag τ = 1: Now we conduct a nonreversible and a reversible maximum likelihood estimation of the transition matrix given C. Eigenvalues for the exact transition matrix in (11) and both nonreversible and reversible estimates for the given count matrix in 12 are shown in Fig. 1. It is seen that the nonreversible estimate contains complex eigenvalues. These generally come in complex conjugate pairs. Fig. 1 shows a much higher accuracy of the reversible estimate compared to the nonreversible estimate. In order to explore the statistical significance of this observation, we run N = 1000 chains of length L = 20 using transition matrix (11). The reversible and nonreversible estimation results, together with the true eigenvalues, are reported below: It is seen that the reversible estimates do not only have the correct real-valued structure, but can also have smaller uncertainties (here especially for λ 3 ). This is expected to be a general result due to the smaller number of degrees of freedom in the reversible estimate.  b. Example 2: Fig. 2 shows the distribution of eigenvalues from nonreversible and reversible Markov models from simulation data for the alanine-dipeptide molecule (see Sec. V B). The eigenvalues of the reversible estimate are purely real while the non-reversible estimate has eigenvalues with non-zero imaginary part.

C. Equilibrium kinetics analyses
Since detailed balance is a consequence of a system simulated at dynamical equilibrium, it is not surprising that detailed balance in the transition matrix P is a prerequisite to analyze the equilibrium kinetics given P . Since kinetics are related to slow processes we will here only consider the m largest eigenvalues λ 1 , ... λ m and assume that they positive. Here are a few examples for equilibrium kinetics properties computed from reversible transition matrices: 1. The dominant relaxation rates of the molecular sys-tem are: where i ≥ 2 (i = 1 has a relaxation rate of zero and corresponds to the equilibrium distribution). The inverse quantities are the relaxation timescales t i = κ −1 i . These rates or timescales are of special interest because they are often detectable in kinetic experiments such as fluorescence time-correlation spectroscopy, two-dimensional IR spectroscopy or temperature jump experiments -see 11 for a discussion.
2. The decomposition (10) can be used to write kinetic experimental observables in an illuminating form 11,13,14 . For example, the long-timescale part of the autocorrelation of a molecular observable a ∈ R n , e.g. containing the fluorescence values of every Markov state of a molecule, can be written as: 3. The PCCA+ method for seeking m metastable sets of Markov states and its variants 8,52 assumes m real-valued eigenvalues and eigenvectors. It is thus only reliably applicable to reversible transition matrices.
4. Discrete transition path theory 53-55 computes the statistics of transition pathways from a set of states A to a set of states B given a transition matrix. Discrete TPT can be used with nonreversible and reversible transition matrices. However, in the reversible case we get that the forward committor and the backward committor are complementary: i and the net fluxes, when ordering states such that q + i ≤ q + j are given by: which is analogous to an electric current I = U G where I = f + ij , U = (q + j − q + i ) is the potential difference and π i p ij is the conductivity 56 .

III. LIKELIHOOD, COUNTING, AND MAXIMUM LIKELIHOOD ESTIMATION
We restate the transition matrix likelihood and formulate the maximum likelihood estimation problem for Markov model transition matrices. We present new estimation algorithms for reversible Markov models with known or unknown equilibrium distribution π.

A. Likelihood
Suppose we have a discrete sequence S = {s 1 , ..., s N } with s i ∈ {1, ..., n}. If we assume that this sequence is the realization of a Markov chain with lag time τ = 1, the probability that a transition matrix P has generated X is proportional to the product of individual transition probabilities along the trajectory: We have neglected the proportionality constant because we won't need them in order to maximize or sample from (15). This is very handy because one component in this constant is the probability of generating the first state, p s0 , which is often unknown, but is constant for a fixed data set S. Suppose we have c ij transitions from i to j. Then we can group all p ij terms together and get a factor p cij ij . Doing this for all pairs results in the equivalent likelihood formulation: We can see that the count matrix C ∈ R n×n is a sufficient statistics for the Markov model likelihood P(S | P ) -it generates the same likelihood although we have discarded the information in which sequence the transitions have occurred.
If multiple trajectories are available, their count matrices are simply added up.

B. Counting
How should we count transitions for longer lag times τ > 1, or if S is not Markovian at lag time τ ? Regarding the first case, if S is Markovian at lag time τ , a safe approach seems to subsample the trajectory at time steps of τ and then treat the subsampled trajectory as above 7,9 . However this approach is statistically inefficient: If S is also Markovian for shorter lag times than τ , then we are using less information than we could. Even if S only becomes Markovian at lag times of τ or longer, transitions such as 1 → τ + 1 and τ /2 → τ + τ /2 are usually only partially correlated, such that discarding the second transition is also not fully exploiting the data. In practical MD simulations, the lag times required such that a Markov model is a good approximation need to be quite long (often in the the range of nanoseconds), such that subsampling the data at τ will create severe problems with data and connectivity loss. Regarding the second case, if S is not Markovian at lag time τ then treating every c ij as an independent count is incorrect.
Both cases can in principle be treated with the following formalism: We always obtain the count matrix c ij in a sliding window mode 7 , i.e. we harvest all N − τ available transition counts from time pairs (1 → τ ), (2 → τ + 1), ..., (N − τ → N ). Unless S is Markovian at lag time 1, we will now harvest more transition counts than are statistically independent. We can formally correct for this by introducing a statistical inefficiency I ij (τ ) for every count at a given lag time, such that c eff ij (τ ) = I ij (τ ) c ij (τ ) is the effective number of counts, resulting in the likelihood The determination of statistical inefficiencies for univariate signals is well established 57 . Determining I ij (τ ) for transition count matrices is an open problem. A first approach that allows for the first time to estimate consistent, although somewhat too small uncertainty intervals for practical MD data is discussed in 58 . Note that the validity of the estimation algorithms described in the present paper are independent of the choice of the count matrix, such that future methods for estimating the effective count matrix can be adopted without changing the estimation ablgorithms.

C. Maximum likelihood estimation
We will now assume that the effective counts are given. For better readability we will subsequently omit the superscript eff and just use C = (c ij ) to indicate counts. Now we ask the question what is the most likely transition matrix for the observation C, i.e. we seek the maximum likelihood estimate (MLE) that maximizes (15) over the set of transition matrices.

Non-reversible estimation
It is well known that the non-reversible MLE for the transition probability from state i to state j is simply given by the ratio of observed counts from i to j divided by the total number of outgoing transitions from state We use the hat in order to denote an estimator. The term non-reversible implies that reversibility has not been used as a constraint in the estimation ofP nonrev . Of coursê P nonrev can be coincidentally reversible and will be reversible if the count matrix C is symmetric. For this reason, some early contributions in the field forced symmetry in C by counting S forward and backward. This practice is strongly discouraged as it will create a large bias unless the trajectories used are very long compared to the slowest timescales of the molecule.

Reversible estimation
Now we consider the problem of finding the reversible MLEP rev by enforcing detailed balance (4) with respect to an unknown equilibrium distribution (π i ) as constraint in the estimation procedure. Note that the count matrix used for this approach is not modified, i.e. it comes from a forward-only or nonreversible counting and is generally not symmetric. The constraints (4) can be more conveniently handled by defining the new set of variables Note that We can thus recover the transition matrix from X = (x ij ) by: Inserting (21) into (16) and adding constraints for detailed balance and stochasticity leads to the reversible maximum likelihood problem: Ignoring the inequality constraints the optimality conditions are with c i = j c ij and x i = j x ij . There is no closed form solution when including the detailed balance constraint so that (22) has to be solved numerically. One option is to directly solve (23) for x ij and turn it into a fixed-point iteration, as first proposed in 37 : where k counts the iteration number in the algorithm. For a starting iterate x (0) ij fulfilling the constraints in (22), for example x , the iterates will be symmetric and fulfill the inequality constraints for all k > 0.
If we sum over j on both sides of (24) and use (20), we can instead reduce the problem to iterative estimation of the equilibrium distribution: The iteration is terminated when ||π (k+1) − π (k) || < . The final estimateπ is then inserted into (23) to recover the reversible transition matrix estimate: Note that both the optimum sought by Eq. (25,26) exhibitsp ij = 0 if c ij + c ji = 0. Thus, in both optimization algorithms, the sparsity structure of the matrix C + C T can be used in order to restrict all iterations to the elements that will result in a nonzero elementp ij > 0. Furthermore, note that (25,26) are special cases of the transition-based reweighing analysis (TRAM) methodsee Ref. 16 , Eqs (29-30) -for the special case of a single thermodynamic state. An example for the progress of the self-consistent iteration using an alanine dipeptide simulation is shown in Fig. 3.
A different method of iterative solution presented in 7 updates x ij with the exact solution to the quadratic problem arising from (23) while holding all other variables x kl fixed. As shown in Fig. 3, this approach can exhibit faster convergence properties than the fixed-point iteration (25).
Uniqueness of the estimator: The optimization problem (22) can be equivalently transformed into a convex optimization problem by replacing the decision variables x ij with z ij = log(x ij ) (see 17 for details), which implies the uniqueness of the maximum likelihood estimator.

Reversible estimation for given stationary vector
While unbiased MD simulations are useful to estimate state-to-state transition probabilities p ij , enhanced sampling algorithms such as umbrella sampling and replicaexchange MD can be much more efficient in order to gain insight of the equilibrium distribution π. Ref. 45 demonstrates how an uncertain estimate of π can be combined with unbiased "downhill" trajectories in order to estimate rare event kinetics. A key in such a procedure is a way to estimate a reversible Markov model that is most likely given transition counts observed from MD simulations, but at the same time has a fixed equilibrium distribution π. Here we derive a new, efficient estimation algorithm for this task.
Enforcing reversibility with respect to a given stationary vector results in the following constrained optimization problem subject to π can only be the unique stationary distribution of P if P is irreducible. To ensure irreducibility, we restrict the state space to the largest (weakly) connected set of the undirected graph that is defined by the adjacency matrix C + C T . For a system with n states, Eqs (27,28,29,30) is a convex minimization problem in O(n 2 ) unknowns with O(n 2 ) equality and inequality constraints. Solving this with a standard interior-point method requires the solution of a linear system with O(n 2 ) unknowns to compute the search direction at each step. The resulting computational effort of O(n 6 ) operations for solving the linear system quickly becomes unfeasible for increasing n. Therefore we will propose a fixed-point iteration that is also feasible for large values of n.
To solve the maximization problem, we ignore the inequality constraint (29) at first. The row-stochasticity constraint (28) is enforced by introducing Lagrange multipliers λ i and adding penalty terms λ i j p ij − 1 for all i = 1, ..., n to the objective function. The detailed balance constraint (30) is included into the likelihood explicitly by the change of variables These substitutions result in the Lagrange function: that we seek to maximize. By setting the gradient of F with respect to all p ij to zero and subsequently reversing the change of variables, we find the following expression for the maximum likelihood estimatê Note the similarity of this equation with the maximum likelihood result (26) where π has been self-consistently computed from the counts. The row counts c i are here replaced by the yet unknown Lagrange multipliers λ i . In order to find the Lagrange multipliers, we sum Eq. (32) over j: This doesn't give a closed-form expression for λ i . However, based on this equation, we propose the following fixed-point iteration for the Lagrange multipliers: Motivated by the analogy between Lagrange multipliers and row counts described above, we set the starting point to: Taking the limit λ i → 0 + in (34) still leads to a consistent solution. Choosing strictly positive starting parameters according to (35) results in valid iterates from (34). In analogy to the reversible case we iterate (34), (35) until λ (k+1) − λ (k) < . An example for the progress of the self-consistent iteration using alanine dipeptide simulation data is shown in Fig. 4. Note that the converges is nearly three orders of magnitude faster compared to the estimation with unknown equilibrium distribution (Fig.  3).
Given converged Lagrange multipliers, we can exploit (32) to find the maximum likelihood transition matrix P . For this algorithm the inequality constraints (29) are automatically fulfilled when c ij ≥ 0 for all i, j. Care must be taken in two situations: (i) λ i = λ j = 0 -one can show that the simultaneous limit λ i → 0 + and λ j → 0 + can only occur for c ij + c ji = 0, but then we know that Depending on the values of π, the solution λ i may take the value of zero such that equation (32) for i = j becomesp ii = c ii /λ i = 0/0 which is meaningless and is not the correct limit of p ii as c ii goes to zero. However this can be fixed easily by usinĝ p ii = 1 − j =ip ij . for the diagonal elements of P . In summary, we use the following equation for computingP from converged Lagrangian multipliers: Uniqueness of the estimator: Since the above estimation algorithm is iterative, it is fair to ask whether the estimatorP it converges to is unique, or whether there might be multiple local maxima that we could get stuck in. In this case, it is easy to show that the estimator is unique: Let P * be an optimal transition matrix. p * ij = 0 exactly if c ij +c ji = 0. Let Ω = {p ij |c ij +c ji > 0}. Then, the function f (P ) = i,j c ij log p ij is strictly convex on Ω and the constraints restrict the solution on a convex subsetΩ ⊂ Ω. The minimization of a strictly convex function over a convex set has a unique solution.   (25) from an n = 228 state count-matrix obtained from alanine-dipeptide simulation data. Convergence is shown for different total simulation lengths T . (b) Performance comparison of the direct fixedpoint iteration (25) and the quadratic optimizer described in Ref. 7 for reversible transition matrix estimation given the count matrix C = ((5, 2, 0), (1, 1, 1), (2,5,20)) . Shown is the difference of the current likelihood to the optimal likelihood. (c) Same as b, but using the 1734×1734 count matrix from Pin WW folding simulations used in Ref. 54 . .

IV. BAYESIAN ESTIMATION
We introduce new algorithms for sampling the full posterior probability distribution of Markov models, and in particular for estimating uncertainties of quantities of interest, such as relaxation timescales or mean first passage times. A key in these algorithms is the choice of a suitable prior which enforces the sampled matrices to have the same sparsity pattern as the transition count matrix, as this allows the credible intervals to lie around the true value even for large transition matrices. The relevance of the prior is first demonstrated for nonreversible Markov models, for which an efficient sampling algorithm is known. We then introduce new Gibbs sampling algorithms for reversible Markov models with and without constraints on the equilibrium distribution that vastly outperform previous algorithms for sampling reversible Markov models.

A. Bayes theorem and Monte Carlo sampling
Bayes' formula relates the likelihood of an observed effective count matrix C given a probability model P to the posterior probability of the model given the observation, The posterior accounts for the uncertainty coming from a finite observation. It incorporates a-priori knowledge about the quantity of interest using the prior probability P(P ). We will see that a suitable choice of the prior is essential for the success of a Bayesian description for high-dimensional systems.
In general, if we are interested in an observable that is a function of a transition matrix, f (P ), we would like to compute its posterior moments, such as the mean and the variance, While we usually use the maximum likelihood transition matrixP to provide "best" estimates, f (P ), the above integrals are of interest because σ(f ) = Var(f ) gives us an estimate of the statistical uncertainty of f . Alternatively, we might be interested in the credible intervals which encompass the true value of f with some probability, such as 0.683 (1σ intervals) or 0.95 (2σ intervals).
As the integrals (38,39) are high-dimensional, we need to use Monte Carlo methods to approximate them.
In Monte Carlo methods we generate a sample of transition matrices {P (k) } N k=1 distributed according to the posterior and evaluate f at each element P (k) in the ensemble. We then approximate the posterior expectation value (38) and the posterior variance (39) by Obtaining good and reliable samples of the posterior P(P |C) is very difficult. Previous approaches have suffered from some or all of the following difficulties, that are addressed here: 1. Choice of the prior: Given n Markov states (typically 100s to 1000s), transition matrices have on the order of n 2 elements and are thus extremely high dimensional. Most priors used in the past allow to populate all these elements p ij , including those for which no transition has been observed. Although the effect of the prior can be overcome by enough simulation data, for any practical amount of simulation data, such priors will lead to posterior distributions whose probability mass is far away from theP true model. This problem has been first addressed in Ref 20 by designing a prior that equates mean and MLE for nonreversible transition matrices, leading to credible intervals that nicely envelop the the true value. Here we design corresponding priors for reversible Markov models.
2. Uncorrelated transition counts C: As discussed in Sec. (III B), the likelihood, and thus the posterior depends on how transition counts are harvested from the discrete trajectories which are generally time-correlated and not exactly Markovian at any particular lag time τ . While the MLE is often not or little affected by the exact way of counting C, the uncertainties will be dramatically different if C is e.g. counted in a sliding window mode (using transitions starting at all times t = 0, 1, 2, ...), or by subsampling the trajectory (using transitions starting at all times t = 0, τ , 2τ ). Whereas the first approach underestimates the uncertainties, the second approach often overestimates them and is often not practical for large lagtimes τ . Here we suggest to use the effective number of uncorrelated transition counts, C = C eff , and a first approach to compute them is described in Ref. 58 .
3. Efficiency of the sampler: Finally, given a choice of prior and C, a sampling algorithm needs to explore the high-dimensional space of transition matrices in a reasonable time. This is especially problematic for reversible Markov models. The first Monte Carlo algorithm for sampling the reversible posterior, described in Ref. 39 , suffers from poor mixing due to small acceptance probabilities of the individual steps. In Ref. 42 , an improved sampler was proposed. Here, we propose sampling algo-rithms for reversible Markov models with and without fixed equilibrium distribution whose efficiencies go far beyond previous approaches.

B. Non-reversible sampling
Let us first illustrate the effect of prior choice on Bayesian estimation of nonreversible Markov models. A convenient functional form for the prior is the Dirichlet prior where B = (b ij ) is a matrix of prior-counts. For this choice, the posterior is given by z ij = c ij + b ij is the matrix of posterior pseudo-counts.
In the non-reversible case we can generate independent samples from (43) by drawing rows of P (k) independently from Dirichlet distributions j p Choosing a uniform prior, b ij = 0, assigns equal prior probability to all entries, p ij , in the posterior ensemble. But this a-priori assumption can lead to serious problems when estimating quantities for meta-stable systems.
Consider for example the following transition matrix for a birth-death chain consisting of two meta-stable sets A = {1, . . . , m}, B = {m + 2, . . . , n}, separated by a kinetic bottleneck in form of a single transition state, For barrier parameter b = 3 and sets with m = 50 and n = 101 the expected time for hitting B from state x = 1 is 2 · 10 5 steps. Now we are interested in the Bayesian estimator for a simulation of length L = 10 7 . The true distribution can be estimated with arbitrary precision by repeating the simulation many times. Here, 10 3 repetitions led to an estimate of the 90% percentile for the mean first passage time of [1.5, 2.7] · 10 5 (see Table below).
In practice, we cannot afford to repeat the simulation many times but would like to approximate the true value and its statistical uncertainty from the given simulation data. Sampling the nonreversible posterior given expected counts for a single chain of length L = 10 6 with a uniform prior, b ij = 0, results in non-zero transition probabilities for elements p ij which are zero in the true transition matrix. As a result artificial kinetic pathways circumventing the bottleneck are appearing in the posterior ensemble which lead to a dramatic underestimate of the mean first passage time. The Bayesian estimate with 90% credible interval obtained from 10 3 posterior samples is [1.9, 2.0] · 10 3 , and thus two orders of magnitude smaller than the true value 2 · 10 5 .
Using the prior b ij = −1 suggested in Ref. 20 results in 90% credible intervals, [1.5, 2.7] · 10 5 , which clearly cover the true value 2 · 10 5 . The choice b ij = −1 leads to a posterior distribution in which sampled transition matrices P have the same sparsity structure as the count matrix C, i.e. p ij = 0 if c ij = 0. As count matrices in the present context are generally sparse, we call this prior briefly sparse prior. Apparently the sparse prior leads to consistent credible intervals covering the true value. method estimate true 2.0 2.7 1.5 · 10 5 uniform prior b ij = 0 1.95 2.0 1.9 · 10 3 sparse prior b ij = −1 2.0 2.7 1.5 · 10 5 Fig. 5 shows the convergence of the 90% credible interval for the sparse and the uniform prior. The credible interval for the sparse prior envelopes the true value already given little data. To achieve consistency using the uniform prior requires simulations order of magnitudes longer than the timescale of the slowest process, thus rendering inference under this prior unpractical.
Note that our prior induces a fixed sparsity structure. This concept should not be confused with other sparsity inducing priors used i.e. in the context of Bayesian compressed sensing 61 , where the sparsity pattern is subject to uncertainty.

C. A prior for reversible Markov models
Now we will present a new method for the sampling of reversible transition matrices. In our new approach we replace the Dirichlet prior (42) by a new prior for reversible sampling.
Similarly as in reversible maximum likelihood estimation, we define our reversible transition matrix sampler in the space of unconditional transition probabilities x ij . For convenience we restrict ourselves to the independent set of variables with i ≤ j (remember that x ij = x ji for reversible matrices), and keep them normalized to 1: Although X is defined slightly different as in the maximum-likelihood case, the mapping from X back to P is still given by Eq. (21). We define a prior for reversible sampling on the set of X matrices rather than on P : Choosing x ij as the set of independent variables has the advantage that obeying detailed balance amounts to sampling symmetric matrices, X = X T .
The posterior for reversible sampling is then given by Below we will first consider how to sample from (48) using general prior counts b ij . Then we will consider the specific choice b ij = −1 for all i ≤ j and show that this choice has similar properties as the sparse prior in the nonreversible case.

D. Sampling reversible transition matrices
There is no known method to generate independent samples from the posterior under the reversibility requirement. Instead we will use a a Markov chain Monte Carlo (MCMC) method to generate samples from the posterior ensuring that each sampled transition matrix fulfills the detailed balance condition (4). Our Markov chain will generate the ensemble {X (k) } N k=1 via a set of updates advancing the chain from X (k) → X (k+1) starting from a valid initial state X (0) . We can do a simple row-normalization of the X matrices to obtain the desired ensemble {P (k) } N k=1 . Expectation values and variances will again be estimated using (40,41).
Similarly as in 39,42 we will construct our Markov chain using a Gibbs sampling procedure, where we sample a single element of X in each step while leaving the other elements unchanged. We repeat this sampling procedure for every element of X, thus completing a Gibbs sweep. As detailed in the appendix, we can use the following general Gibbs step to sample the posterior (48): 1. Select an arbitrary element x kl . Propose a new (unscaled) matrix X → X by sampling this element from the proposal density q(x kl |X): here q(x kl |X) is an arbitrary, scale-invariant density. Scale-invariance means that q(x kl |X) ∝ q(cx kl |cX) for any positive constant c.
2. AcceptX as a new step in our Markov chain with probability min{1, p acc } where where b 0 = k≥l b kl and γ is the marginal density: While this approach will work for any choice of prior counts, we will now use the sparse prior b ij = −1 for all i, j with the hope to achieve similarly good results as in the nonreversible case. For this choice, γ(x kl |X) is scaleinvariant, i.e. γ(x kl |X) = γ(cx kl |cX), and the Jacobian pre-factor in (50) is one. Thus we have: Thus the ideal choice of the proposal density is q ≡ γ, which would guarantee that the acceptance probability is always 1. This proposal density degenerates to a point probability at zero if c kl + c lk = 0, which implies b ij = −1 encodes a priori belief that any transition for which neither the forward direction nor the backward direction has ever been observed in the data has zero probability in the posterior ensemble. Thus, this prior enforces P to have the same sparsity structure as the count matrix, like the choice b ij = −1 for nonreversible sampling. Note that the reversible MLE has the same sparsity structure as can be seen from the update rule (24). We will choose proposal densities γ(x kl |X) that are also scale-invariant. In this case the normalization step 3 above (X →X ) can be omitted, i.e. if we accept X , we can directly set it as our new sample X (k) and obtain P (k) by row normalization. We will now outline how to design the proposal density γ such that the acceptance probability is 1 or nearly 1. For k = l, sampling x kk ∼ γ(x kk |X) is equivalent to sampling the following transformed variable (see Appendix): So we can simply define q(x kk |X) ≡ γ(x kk |X) and generate x kk by For k = l, there is no known way to draw independent samples, but γ(x kl |X) can be well approximated by a Gamma distribution by matching the maximum point and the second derivative at the maximum. A Gamma distribution can be efficiently sampled and we use it as proposal density and accept the resulting x kl with probability min{1, p acc }. Specifically, our proposal step is: with the parameters which matches the value and the first two derivatives of the true marginal density at the maximum (see Appendix for derivation), and leads to acceptance probabilities close to one for most values of x kl . However, if the current value of x kl is in one of the heavy tails of the distribution γ(x kl |X), the acceptance probability can be much less than 1 and the Markov chain can get stuck.
In order to avoid this problem, we utilize a second step to generate x kl : After we sample x kl from the proposal density (55) we sample x kl according to: where N (x|m, s) denotes the Normal distribution of x with mean m and standard deviation s. The proposal density defined by the above update is and the corresponding acceptance probability is: x kl x kl (65) In summary, the proposed Algorithm 1 is a Metropolis within Gibbs MCMC algorithm with adapted proposal probabilities for each Gibbs sampling step. For efficiency reasons, transition matrix elements (i, j) for which no forward-or backward transition counts have been observed, can be neglected in the sampling algorithm, in order to account for the effect of the sparse prior.
with probability min{1, pacc} Sample x kl by log x kl ∼ N (log x kl , 1).
x kl x kl Accept x kl as x (j+1) kl with probability min{1, pacc} end end

E. A prior for reversible Markov models with fixed equilibrium distribution
As before we will be working with variables x ij = π i p ij related to transition matrix entries p ij via (45). In contrast to the previous algorithm, π is not a function of P but fixed. Thus the single normalization condition (46) is replaced by a condition for each row: in order to ensure reversibility with respect to the given π. All x kl in the lower triangle (k > l) are used as independent variables. Given a valid X matrix, an update that respects the constraints is given by (67d) (67b) and (67d) ensure that the normalization condition (66) holds for the new sample and will thus keep π constant, while (67c) restores the symmetry and thus ensures reversibility of P . We will again use the prior (47) defined on the set of X matrices and sample from the posterior (48). The ideal proposal density of x kl is which is the conditional distribution density for given all off-diagonal elements of X (except x kl ), π and the counts C.
We have seen that a correct choice of prior parameters was essential in order to successfully apply the posterior sampling for meta-stable systems. As in the reversible case we will use b kl = −1 for k > l to enforce x kl = 0 whenever c kl + c lk = 0.
However, the choice of prior counts for the diagonal elements b kk is less straightforward. According to our experience, a good choice is to determine the value of b kk based on the maximum likelihood estimatep kk of the k-th diagonal element as which ensures that the posterior expectation of p kk is zero if and only ifp kk = 0, and the conditional expectation of (67e), matches the MLE of the one-dimensional likelihood function for x kl given X ifp kk > 0 and c kk = 0. (Note that for the MLE of P , there is at most one k which satisfieŝ p kk > 0 and c kk = 0 -see proof in Appendix.) However, in the case thatp kk = 0, the conditional (67e) would then degenerate so that x kk = 0 with probability one, and the k-th row and column of X would remain fixed in the sampling process. This effect can break ergodicity in the sampled Markov chain and therefore prevent convergence of the algorithm. This problem is avoided by regularizing the prior choosing the prior parameter as b kk = −1 + forp kk = 0 such that (67e) does not degenerate, where > 0 is a small number. In addition we need to ensure that the Markov chain is started from an initial state X (0) with x (0) kk > 0. In summary, we select the prior of X for reversible sampling with fixed π This choice of prior will again ensure that c kl + c lk = 0 results in p kl = 0 and p lk = 0 for all k < l and for all posterior samples, a property shared by the reversible MLE with fixed stationary vector. This ensures that the posterior mass is located around the maximum likelihood estimateP and again prevents the occurrence of artificial kinetic pathways in the posterior ensemble.

F. Sampling reversible Markov models with fixed equilibrium distribution
We now investigate how to efficiently sample the conditional (67e). Here we assume without loss of generality that x kk < x ll and transform x kl ∈ (0, The ideal proposal density of v is then −(c kl +c lk +c kk +c ll +b kl +b kk +b ll +2) with Like in the previous algorithm, we can approximate the conditional of v by a Gamma distribution as: with See Appendix for derivation. Then v can be sampled by a Metropolis sampling step with the proposal density Gamma(v |α, β) and the acceptance ratio min{1, p acc } = min 1, where v = x kl /x kk denotes the original value of v. In addition, in order to avoid the sampler from getting stuck at an extremely small or large value of v, we also utilize the same strategy as in Section IV D to generate v by .
The proposed Algorithm 2 for sampling of reversible transition matrices with fixed stationary vector can again be characterized as a Metropolis within Gibbs MCMC algorithm with adapted proposal probabilities.

A. Validation
We first demonstrate the validity of the reversible sampling algorithm for the following 2 × 2 count-matrix, In Fig. 6 we compare the sampled histograms using Algorithm 1 with analytical values for the non-reversible posterior with Dirichlet-prior-counts b ij = −1. Any 2 × 2 transition matrix automatically fulfills detailed balance, and therefore the analytical and sampled densities are expected to be equal. The histogram for samples of the reversible algorithm are indeed in agreement with the analytical posterior. In Fig. 7 the sampled histogram for count matrix (81) with fixed stationary distribution π = (0.25, 0.75) using Algorithm 2 is compared with the exact posterior distribution. The detailed balance relation (4) with fixed stationary vector enforces a linear dependency between the transition matrix element p 12 and p 21 . The resulting posterior is therefore restricted to the line π 1 p 12 = π 2 p 21 such that the projection on p 12 in Fig. 7 already contains the full information about the one-dimensional posterior. A comparison between histogram frequency and analytical density demonstrates the validity of the algorithm.

B. Applications
To demonstrate the usefulness of the proposed algorithms we apply them to molecular dynamics simulation data. Here, two systems are chosen to illustrate our methods: (1) The alanine-dipeptide molecule, and (2) the bovine pancreatic trypsin inhibitor molecule (BPTI).
We start by discussing the alanine dipeptide results. The system was simulated on GPU-hardware using the OpenMM simulation package 62 using the amber99sb-ildn forcefield 63 and the tip3p water model 64 . The cubic box of length 2.7nm contained a total of 652 solvent molecules. We used Langevin equations at T = 300K with a time-step of 2f s. A total of 10µs of simulation data was generated. The φ and ψ dihedral angles were discretized using a 20 × 20 regular grid to obtain a matrix of transition counts C = (c ij ), here by sampling one count per lag time τ . Below we will show histograms for two important observables, largest implied time-scales t i and expected hitting times, τ (A → B), for pairs A, B of meta-stable sets. We compute the posterior sample-mean and 90% credible intervals for 1µs of simulation data and show that the credible intervals nicely envelop a reference value obtained from the MLE transition matrix for the total simulation data, supporting the proposed prior as a 'good' choice for reversible sampling in meta-stable systems.  Tables II and III compare the reference values with the sample mean µ and sample standard deviation σ for each observable.
In order to gain a first impression of the efficiency of the sampling and the quality of our estimates, we compute the integrated autocorrelation time t corr for each quantity sampled (here implied timescales and hitting times). The error of the sample mean m[f ] compared to the true mean f can then be estimated as where N eff = N/(1 + 2t corr ) is the effective number of samples with N the total number of samples. See Ref. 65 for a thorough discussion. t corr and are also reported in Table (II). Fig. 8d-f) show histograms for expected hitting times for the three transitions C 5 → C ax 7 , C 5 → α L and C 5 → α R between meta-stable sets in the φ and ψ dihedral angle plane. Again, mean values are in good agreement with the corresponding reference values. the standard deviation σ, the estimated correlation time t corr and the error of the mean value, .

Alanine dipeptide, reversible sampling with fixed equilibrium distribution
Below we report results for sampling with fixed stationary distribution. The stationary distribution π i was, for sake of simplicity, computed using the relative frequencies of state occurrences, It should be noted that a more useful and independent source of π are enhanced sampling simulations targeted at rapidly generating a good estimate of the equilibrium probabilities alone. See Ref. 45 for methods and applications that combine MD simulations and enhanced sampling simulations in order to efficiently compute rareevent kinetics.
Results are shown in In Table IV and Fig. 9. The sample mean is again in good agreement with the reference value. For the computatio of the reference value we use the MLE transition matrix of the full simulation data reversible with respect to the input stationary distribution for the posterior sampling. The additional constraint imposed by fixing the stationary distribution is clearly reflected in smaller standard deviations for all shown observables compared to the reversible case.
Histograms for expected hitting times between metastable sets are shown in Fig. 9d-f). The sample mean is again in good agreement with the reference value. Again, we summarize our results, c.f. Table V.    Table IV for the definition of symbols.

Bovine pancreatic trypsin inhibitor, reversible sampling
For BPTI, we used the 1 ms simulation generated on the Anton supercomputer 66 . Please refer to that pa-per for system setup and simulation details. We prepared data as follows: C α atom positions were oriented to the mean structure and saved every 10 ns, resulting in about 100,000 configurations with 174 dimensions. Timelagged independent component analysis (TICA) 32,67 was applied to reduce this 174-dimensional space to the two dominant IC's as a spectral gap was found after the second nontrivial eigenvalue. k-means clustering with k = 100 was used to discretize this space.
Effective count matrices were obtained using the method described in 58 at a range of lag times up to 2 µs. Fig. 10 shows the implied relaxation timescales obtained from a maximum likelihood estimate with values comparable to the Hidden Markov model analysis in Ref. 10 . The figure also shows uncertainties computed from a reversible transition matrix sampling as described above with N = 1000 samples. Only every 20th transition matrix sample was used to compute timescales in order to reduce the computational effort to 50 eigenvalue decompositions. It is seen that the MLE is nicely contained in the 2σ (95%) credible interval. The entire transition matrix sampling for Fig. 10   analysis of the computational efficiency is made. Figure 10. Implied timescales for a Markov model of bovine pancreatic trypsin inhibitor (BPTI). The error bars are 95% confidence intervals estimated using the reversible transition matrix sampling algorithm described here using transition counts as described in Ref. 58 .

C. Efficiency
We compute acceptance probabilities of the Metropolis-Hastings steps and compare the statistical efficiency of the proposed sampling algorithm with the algorithm proposed in Ref. 39 that uses uniform proposal densities. Efficiency is measured in terms of achieved autocorrelation times for sampling of transition matrices with different sizes. As a representative observable we choose the largest relaxation time-scale t 2 for the alanine dipeptide molecule and compute autocorrelation functions and autocorrelation times from a large sample of size N = 10 6 . Two differently fine discretizations were used, resulting in n × n-shaped transition matrices with n = 258 and n = 1108. Fig. 11 shows autocorrelation functions for the second largest relaxation time-scale, t 2 , for a reversible posterior ensemble. The autocorrelation function for the reversible sampling algorithm with posterior adapted proposals shows a much faster decay than the autocorrelation function for the algorithm in Ref. 39 . Table VI compares acceptance probabilities and autocorrelation times. The present proposal steps lead to very high acceptance probabilities, p > 0.99, for the sampling off-diagonal entries. The main advance, however comes from the fact that the step for sampling diagonal transition matrix elements in Ref. 39 has suffered from a very poor acceptance probability. As that step was the only step that modified the equilibrium distribution, the sampler in Ref. 39 has very poor mixing properties. In contrast, our current algorithm generates independent samples for the diagonal elements, resulting in an acceptance probability of p = 1.0.
The autocorrelation times for the new sampler are more than a factor 5 shorter for the small (233 state) matrix and more than a factor 13 shorter for the large (1108 state) matrix, indicating a much higher efficiency of the new approach. The autocorrelation time of the new algorithm increases only mildly for matrices of increased dimension, indicating that the present algorithm will be useful for very large Markov models. Fig. 12    adapted proposals in reversible sampling algorithm with fixed stationary distribution again result in a much faster decay of the autocorrelation function than the uniform proposals of the algorithm in Ref. 39 . Table VII compares acceptance probabilities and autocorrelation times for the two algorithms. For sampling with fixed stationary vector there is no sampling step for the diagonal elements. Although the average acceptance probabilities are only a factor of 3-4 better for our new algorithm, the autocorrelation times are decreased by a factor 35 for the small system (233 states) and over a factor 300 for the large system (1108 states). Again, there is only a mild increase in autocorrelation time when the dimension of the sampled space is increased.

VI. CONCLUSION
In this work we have described and significantly extended the state of the art in reversible Markov model estimation. Reversible Markov models are expected to naturally arise from molecular dynamics implementations that fulfill microscopic reversibility. Reversibility is an essential property in order to analyze the equilibrium kinetics of a molecule. However, in order to have reversibility in a Markov model, it needs to be enforced in the estimation procedure. When done correctly, reversible estimation does not bias the model but rather reduces statistical errors as a result of a smaller number of degrees of freedom.
We have presented minor improvements to an existing self-consistent estimation algorithm for reversible Markov models. Then we have presented a new and efficient algorithm to estimate reversible maximum likelihood Markov models given a fixed equilibrium distribution.
The main part of the presented work focuses on the long-standing problem of Bayesian estimation of the posterior ensemble of reversible transition matrices. Although several algorithms to sample reversible Markov models have been presented in the past, they have been hampered by three fundamental problems, two of which are addressed here: (i) Which prior should be chosen such that the posterior is located around the true value rather than completely elsewhere? (ii) How should transition counts be obtained when time series are correlated and not really Markovian at any given lag time τ ? (iii) How can the sampling algorithm be made efficient such that also large transition matrices can be sampled in reasonable time?
To address problem (i), we develop priors that ensure that reference values and sample mean are similar. The key property of these priors is that they make the a priori assumption that transitions between states that have not been sampled in the trajectory in either direction, have zero probability. This is a sparse prior, i.e. an improper prior enforcing that the sampled transition matrix has the same sparsity structure as the maximum likelihood estimate and as induced by the observation. In contrast to most other priors that have been previously suggested, these sparse priors achieve the desired property of creating errors bars that nicely envelop the reference estimates.
For problem (ii), we have described the principles of how it can be addressed. We suggest that effective count matrices are obtained using the concept of statistical inefficients. A separate preprint 58 suggests an initial solution towards this aim that is successfully applied on simulation data of the bovine pancreatic trypsin inhibitor in the present paper. The solution of problem (ii) is still in its infancy and needs further investigation..
For problem (iii), we present highly efficient Gibbs sampling algorithms for reversible transition matrices and reversible transition matrices with fixed equilibrium distribution. Both methods are demonstrated to have acceptance probabilities close to 1 in their individual update steps. Autocorrelation times from samples of the slowest relaxation timescale are one or two orders of magnitude shorter than with a previous Gibbs sampling algorithm, indicating a high statistical efficiency of our sampler.
Implementations of all algorithms described here are available in PyEMMA 68 as of version 2.0 or laterwww.pyemma.org .