Optimal estimates of self-diffusion coefficients from molecular dynamics simulations

Translational diffusion coefficients are routinely estimated from molecular dynamics simulations. Linear fits to mean squared displacement (MSD) curves have become the de facto standard, from simple liquids to complex biomacromolecules. Nonlinearities in MSD curves at short times are handled with a wide variety of ad hoc practices, such as partial and piece-wise fitting of the data. Here, we present a rigorous framework to obtain reliable estimates of the self-diffusion coefficient and its statistical uncertainty. We also assess in a quantitative manner if the observed dynamics is indeed diffusive. By accounting for correlations between MSD values at different times, we reduce the statistical uncertainty of the estimator and thereby increase its efficiency. With a Kolmogorov-Smirnov test, we check for possible anomalous diffusion. We provide an easy-to-use Python data analysis script for the estimation of self-diffusion coefficients. As an illustration, we apply the formalism to molecular dynamics simulation data of pure TIP4P-D water and a single ubiquitin protein. In a companion paper [J. Chem. Phys. XXX, YYYYY (2020)], we demonstrate its ability to recognize deviations from regular diffusion caused by systematic errors in a common trajectory “unwrapping” scheme that is implemented in popular simulation and visualization software.


I. INTRODUCTION
Brownian motion is one of the pillars of biological physics, being observed on both microscopic and mesoscopic scales. Einstein's seminal work 1 established the mean squared displacement (MSD) as the central observable to characterize the jittering motion of microscopic objects. On sufficiently long time scales, the MSD of a freely diffusing tracer particle or macromolecule grows linearly in time with a slope directly proportional to its self-diffusion coefficient D. Accordingly, D is commonly estimated via linear fits to measured MSDs, 2 which at first glance may seem like a rigorous approach, because the resulting estimate is unbiased. However, the precision of the estimate suffers if too many MSD values are used for the fit. 3,4 This counter-intuitive behavior results from the fact that the most common estimator for the MSD of a finite time series {X 0 , X 1 , . . . , X N −1 , X N }, namely introduces correlations between the MSD i at different time lags t i = i∆t with i = 1, 2, . . . , M ≤ N . Furthermore, the underlying dynamics may not be purely diffusive, as is often the case with molecular dynamics (MD) simulations. Non-diffusive dynamics, arising, e.g., due to ballistic motion, 5 short-lived memory 6 or caging effects, 7 affect the MSD values at short times. As an example of a complex diffusion process where the transition to a) Electronic mail: gerhard.hummer@biophys.mpg.de regular diffusion is slow, Vargas and Snurr 8 analyzed the translational diffusion of alkanes in an anisotropic metalorganic framework. A common strategy to characterize non-diffusive short-time dynamics is to invoke elaborate non-Markov processes, 9 but these have to be specifically tailored to the data at hand. We therefore face the problem that diffusion coefficient estimates using short-time data are compromised by possible non-diffusive dynamics, and estimates using long-time data suffer from large statistical uncertainties.
To address the latter, various proposals for improved diffusion coefficient estimation have been published in recent years, 4,10-12 correcting for experimental artifacts and making more efficient use of the available data. Here, we introduce a rigorous framework that combines these sophisticated estimators with a sub-sampling procedure, where the length of the recording-time interval is varied to suppress nonlinearities in the MSD curves on short time scales. The resulting sub-sampled dynamics is then modeled via a diffusive process X combined with an instantaneous stepwise spread, satisfying whose details are presented at the beginning of Sec. II. The associated self-diffusion coefficient is given by where ∆t denotes the length of the time interval between two consecutive observations X i and X i+1 ∀i. By introducing the static noise parameter a 2 into our generic model of diffusion, we aim to account for "molecular events" such as correlated collisions and cage diffusion without having to use complex system-specific dynamical models. A price we pay is that the process X becomes valid only for time intervals ∆t that exceed the time scale of the molecular events. We note that ballistic motion, e.g., in an underdamped Langevin equation would lead to a 2 < 0. However, in practice, such down-shifts are normally not observed, because they get compensated by other effects that broaden the MSD.
The process X is compatible with the above-mentioned diffusion coefficient estimators, a few of which we briefly review and then compare performance-wise in Sec. II.
To determine the range of interval lengths where X appropriately describes the data at hand, we propose a quality factor based on χ 2 -statistics in Sec. III A to decide whether an estimate overfits or underfits the model to the data. As a probe of possible anomalous diffusion, the resulting diffusion coefficient estimate obtained at short times can then be compared to observed long-time dynamics, as described in Sec. III B. We illustrate our framework by applying it to various MD trajectories in Sec. IV: First on an ensemble of pure TIP4P-D water 13 (Secs. IV A-IV C) and then on a single ubiquitin protein solvated in TIP4P-D water (Sec. IV D). Although the analysis of the latter system reveals some ambiguities, our framework identifies the dynamics as diffusive with a self-diffusion coefficient that also properly captures ubiquitin's long-time behavior (Sec. IV E). In the companion paper, Ref. 14, we show that unreasonable quality-factor values arise due to previously overlooked shortcomings in a trajectory "unwrapping" scheme used widely in the analysis of MD simulations at constant pressure. Finally, Sec. V provides a summary of our results and the Appendix gives the interested reader further details on some of the more elaborate derivations.

II. DIFFUSION COEFFICIENT ESTIMATION
At short times, inertial dynamics and correlated local motions lead to deviations from simple diffusion typically resulting in a fast initial spread of the particle position. On longer time scales, the resulting MSD then resembles Eq. (2). Let us therefore consider two discrete-time Wiener processes, Z and X, where the latter process evolves on top of each realization Z i of the former. The dynamics of the two processes is captured by the following iterative equations, where R and S denote uncorrelated normal distributed random variables with zero mean and unit variance, and δ i,j is the Kronecker delta that evaluates to one if i = j and zero otherwise. By construction, we have Z i S j = 0 ∀i, j and Z i R j = 0 ∀i ≤ j. Because Z and X are also both normal distributed, their distributions are fully characterized by the mean and (co)variance, namely which result in Eq. (2) for the MSD of X.
In the following, we shall discuss a few established diffusion coefficient estimators within the context of the process X, and then compare them performance-wise.

A. Ordinary least squares estimators
A linear fit to the MSD corresponds to minimizing the sum of squared residuals between the data [Eq. (1)] and the model [Eq. (2)], thus giving rise to a set of ordinary least squares (OLS) estimators for some M ≤ N . In non-pathological cases, the OLS problem [Eq. (4)] is analytically tractable and results in the following expressions for the estimators, Although these estimators are unbiased, their precision drastically decreases for increased numbers M of MSD values used in the fit. This counter-intuitive behavior can be read off their corresponding variances (see Appendix A) and, to circumvent this shortcoming, it was suggested in Ref. 4 to vary the value of M to single out the estimators with the smallest variances. The associated standard deviation is typically much larger than ad hoc uncertainty estimates, such as those constructed from fits to handselected linear regimes in the data. 15

B. Covariance-based estimators
In Ref. 12, an alternative to the OLS estimators was proposed, which circumvents the computation of MSD i via Eq. (1) altogether. It makes use of the fact that (X n+1 − X n )(X n − X n−1 ) = − a 2 2 must hold ∀n, as well as Eq. (2) evaluated at i = 1. This results in the unbiased covariance-based estimators (CVE) whose variances are given by var(σ 2 CVE ) = 4 While Eqs. (6) and (7) are computationally inexpensive to evaluate, they are only guaranteed to be practically optimal for signal-to-noise ratios σ 2 /a 2 larger than one. 12

C. Generalized least squares estimators
To improve upon the OLS scheme, one can take into account correlations between the residuals, which are collected in the covariance matrix of the MSD i -values with elements (see Appendix B) for i, j = 1, 2, . . . , M and where Θ(z) denotes the Heaviside unit step function. Equation (4) is then replaced by (a 2 GLS , σ 2 GLS ) = arg min in a procedure commonly referred to as generalized least squares (GLS). Note that the covariance matrix is evaluated at (a 2 , σ 2 ) = (a 2 GLS , σ 2 GLS ), and thus remains fixed while a 2 and σ 2 are being varied. The a priori unknown a 2 GLS and σ 2 GLS are then found by requiring selfconsistency.
In general, there exist no closed-form expressions for the estimators of the GLS problem [Eq. (9)]. They satisfy the following transcendental coupled equations, which have to be solved numerically, e.g., via iterative root-finding methods. In Appendix C, we provide one such algorithm, based on fixed-point iteration, which we have implemented in a Python data analysis script. 16 Analogous to a 2 OLS and σ 2 OLS , the estimators in Eqs. (10) are asymptotically unbiased. Lower bounds for the variances of the GLS estimators can be inferred from the inverse of the Fisher information matrix I associated with the likelihood L(a 2 , σ 2 ) ∝ exp −χ 2 (a 2 , σ 2 )/2 , which is constructed from the χ 2 -statistic [Eq. (9b)]. In our case, the components of I read and result in the inverse matrix The variances of the GLS estimators are thus estimated from below by Whenever the estimators are Gaussian, equality holds in Eqs. (11).

D. Properties of the GLS estimators
It can be shown (see Appendix D) that Eqs. (10) are of the form which implies, on the one hand, that must hold for sufficiently large signal-to-noise ratios σ 2 /a 2 , and, on the other hand, that in the special case of a ≡ 0, the estimator becomes analytically tractable ∀M, N and simply reads In this case, Eqs. (11) reduce to Another special case for which Eq. (12c) becomes exact is given by M = 2. Equations (10) are then analytically tractable and give rise to the closed-form estimators with the following variances, These results coincide with the OLS estimators and their respective variances (given in Appendix A) for M = 2.
E. Application to three-dimensional time series Up until now, we have solely focused on one-dimensional time series, while two-and three-dimensional particle trajectories are recorded in experiments and MD simulations. For a three-dimensional time series, we can decompose the associated MSD in the following way, where the MSD d,i -values are obtained by evaluating Eq. (1) using the respective one-dimensional time series along each spatial dimension d ∈ {x, y, z}. If we furthermore model the stochastic dynamics along every Cartesian coordinate via our minimal diffusion process X, then the expectation value of Eq. (17) is simply given by with D = σ 2 3D /6∆t. The elements of the associated covariance matrix read Here, we have assumed no correlation between dimensions, i.e., MSD d1,i MSD d2,j = MSD d1,i MSD d2,j . This linear behavior propagates all the way up to the estimators θ 2 EST ∈ {a 2 EST , σ 2 EST } for EST ∈ {OLS, GLS, CVE} and their respective variances, resulting in The latter relation follows from the fact that holds for all the estimators considered in this paper. Finally, a χ 2 -statistic can be calculated for three-dimensional time series by replacing MSD i , a 2 , σ 2 and Σ i,j (a 2 GLS , σ 2 GLS ) in Eq. (9b) with their three-dimensional analogues, giving

F. Comparing the estimators
Because all the above-mentioned estimators are unbiased, we must compare their respective variances to rank them. Under the assumption that all four estimator pairs predict roughly the same values for a 2 and σ 2 , which is realized in practice for N 1, we can compare their corresponding variances as functions of the signal-to-noise ratio σ 2 /a 2 . In general, the GLS estimators outperform their counterparts at low and intermediate signal-to-noise ratios, as visualized in Fig. 1a. Only for σ 2 /a 2 1 do the CVE estimators eventually catch up, as seen by their intersection point with the closed-form GLS estimators. For M = 2, the estimator variances cross at For M > 2, the crossing point must be found numerically and behaves roughly like (see Fig. 1b) Therefore, for M > 2, the GLS estimators tend to be superior already for comparably short trajectory lengths.

III. STATISTICAL TESTING OF MODEL ASSUMPTIONS AND QUALITY OF FIT
Another advantage of the GLS estimators is their natural compatibility with the quality factor that is introduced  below and serves as a measure for the quality of diffusion coefficient estimates. This section also provides a test to probe for possible anomalous diffusion in the long-time limit.
A. Determining the optimal length of the recording-time interval If non-diffusive short-time dynamics are present in a time series, their influence can be suppressed via subsampling, where intermediate observations from the original time series {X 0 , X 1 , . . . , X N −1 , X N } are removed to generate a set of shorter time series with respective recording-time interval lengths ∆t 1 , ∆t 2 , ∆t 3 , . . . , ∆t n , where ∆t n = n∆t 1 . Alternatively, one can choose different starting points to sub-sample the time series; e.g., it is just as valid to use While a longer interval will generally result in a more linear MSD curve, at least for n N , the shortened length of the new time series adversely affects the uncertainty of the diffusion coefficient estimate. The estimator uncertainty can be improved somewhat by including in the analysis all n − 1 interspersed time series constructed from the discarded points between X i and X i+n . However, these additional time series are not fully independent, so their inclusion has only a modest effect and is therefore not considered in what follows.
We balance the competition between systematic and statistical errors at short and long ∆t n , respectively, by exploiting the fact that the GLS estimators originate from the are normal distributed with Σ −1/2 i,j denoting the elements of the inverse square root of the covariance matrix [Eq. (8)] evaluated at (a 2 , σ 2 ) = (a 2 GLS , σ 2 GLS ). If there are nonlinearities present at short times that skew the distribution of the residuals, we expect atypical χ 2 -values and the recording-time interval has to be lengthened. However, instead of focusing on the broadly distributed χ 2 -values, we instead consider the associated quality factor which coincides with the reciprocal cumulative distribution function (CDF) of the χ 2 -statistic and therefore only takes values between zero and one. Here, denote the lower incomplete and ordinary Γ-functions, respectively. The quality factor can be seen as the probability to observe a χ 2 -value greater than χ 2 (a 2 GLS , σ 2 GLS ).
If the average quality factor is significantly lower than 1/2, certain features of the data are not well captured by our diffusive model, thus requiring us to lengthen the recording-time interval. Conversely, Q > 1/2 hints at overfitting. The optimal interval length ∆t opt marks the instance, where the quality factor first reaches Q ≈ 1/2 and the residuals in Eq. (20) follow a normal distribution.
If the time series behind the Q (k) -values are sufficiently long, then the quality factor should remain approximately constant well beyond ∆t opt , before slowly drifting off towards unity.
As an alternative to the above procedure, one could use only parts of the data, e.g., the time series along x and y, to evaluate the GLS estimators. The quantities χ 2 and Q would then serve as tools to measure how well the estimators fit the remaining data. Such use of the above concepts would only require minor alterations to Eqs. (9b) and (21), in particular changing the number of degrees of freedom from M − 2 to M .
B. Comparing short-time predictions to observed dynamics at long times A common concern is that regular diffusion cannot fully account for the dynamics in complex molecular systems, which must then be described using more elaborate models that incorporate memory or trapping effects. To make sure our short-time diffusion coefficient estimate D(∆t n = ∆t opt ) can correctly describe the actual dynamics of our system at long times, we take advantage of the fact that typical MD trajectories are long compared to the time range M ∆t opt used to fit the diffusion coefficient. We therefore compare the statistics of the relative endpoints ∆X (k) = X Here, erf(z) denotes the error function. The discrepancy test above measures the absolute size of the largest difference between the two distribution functions, and the test statistic S can be used to compute a corresponding p-value, i.e., the probability that, if the endpoints had actually been drawn from F (∆X), the resulting KS statistic would have been greater or equal to S. Alternatively, one can vary D in Eq. (23) to find the diffusion coefficient that minimizes Eq. (24) and thus best describes the observed long-time dynamics. This is possible because a 2 (∆t opt ) 2D(∆t opt )N ∆t 1 for N 1, so a 2 can either be neglected or kept constant at a 2 = a 2 (∆t opt ) while D is varied.
In the following, we refer to the use of Eqs. (23) and (24) as a KS test.

A. MD simulation of TIP4P-D water
We tested the applicability of the GLS estimator for the diffusion coefficient by performing a 1 µs MD simulation of water at ambient conditions. We placed 4139 TIP4P-D water molecules 13 in a cubic simulation box with an approximate edge length of 5 nm and periodic boundary conditions. We recorded the coordinates of each atom in the system at intervals of ∆t 1 = 1 ps to capture possible non-diffusive short-time dynamics. The simulation was run at 300 K 17 and 1 bar 18 using Gromacs/2018.6 19 with temperature-and pressure-coupling constants τ T = 1 ps and τ p = 5 ps, respectively. After an initial 100 ps equilibration period in the N V T -ensemble, and a consecutive 5 ns run in the N pT -ensemble, the production run ensued at the same temperature and pressure. We made use of the leap-frog integrator with a 2 fs time step. All atomic bonds to hydrogens were treated with the LINCS constraint algorithm. 20 For data evaluation, we made use of an improved scheme 14 to "unwrap" the three-dimensional center-ofmass trajectories of every water molecule out of the simulation box, which we then split into three independent time series -one for every Cartesian coordinate. These onedimensional time series were used to estimate the MSD along the spatial dimensions d ∈ {x, y, z} via Eq. (1), after which the estimates were summed up to give the total MSD for every molecule k = 1, 2, . . . , 4139 according to Eq. (17). All trajectories displayed a clear non-diffusive initial regime, which became less pronounced as the length of the recording-time interval was increased, followed by a more-or-less linear growth at longer times.
whose uncertainty was determined from the sample standard deviation as follows,  13 It is slightly lower than the experimentally determined value of 2.3 nm 2 ns −1 , measured at 298 K. 21 The main reasons for this discrepancy are, on the one hand, that the TIP4P-D model slightly overestimates the viscosity of water 22 and, on the other hand, that finite-size effects in simulations using periodic boundary conditions further reduce the diffusion coefficient. Although closed-form corrections have been developed both for three-dimensional in-bulk diffusion 23 and for quasi two-dimensional diffusion within membranes, 24 we will not make use of them here, because finite-size effects are systematic and therefore not estimator-specific. The inbulk correction is, however, accounted for in our Python data analysis script. 16

C. Diffusion coefficient estimation and quality factor analysis for TIP4P-D water
Analogous to the previous section, we considered the center-of-mass trajectories of every molecule k along each spatial dimension d, for which we computed the associated MSDs via Eq. (1). These were substituted into Eqs. (10) to obtain the GLS estimates a 2 d,GLS which was evaluated with the help of Eqs. (11). The analytical and numerical estimates of the uncertainties, Eqs (25c) and (26), coincide near perfectly for most interval lengths. A lower bound ∆t opt to the range of viable interval lengths, where the trade-off between estimate quality and uncertainty is balanced, was determined via the quality factor [Eq. (21)], which we evaluated for all molecules present in the system using different recording-time interval lengths ∆t n . We thereby calculated a corresponding the diffusion coefficient estimate and the quality factor did not vary significantly with M .
As discussed in the companion paper, Ref. 14, we originally used the built-in tool trjconv in Gromacs to unwrap the simulation trajectories, which resulted in a highly oscillating diffusion coefficient estimate for short recordingtime intervals, and a slower convergence of the quality factor to Q ≈ 1/2. These irregularities intensified as the edge length of the simulation box was reduced, which made us wary of the unwrapping scheme implemented in trjconv, and prompted us to develop an improved unwrapping scheme appropriate for simulations at constant pressure. 14 Finally, in Fig. 3a, we give a comparison of all the different estimators discussed in this paper, evaluated for our TIP4P-D water trajectories with ∆t n = ∆t opt = 10 ps and, when appropriate, M = 20. At this interval length, we estimated an average noise parameter of a 2 GLS = (8.4 ± 0.5) × 10 −3 nm 2 for TIP4P-D water, which gives a fairly high signal-to-noise ratio (σ 2 /a 2 ≈ 15), where all estimators, aside from the OLS estimators, have essentially the same uncertainty (see Fig. 1). The ratio is still below the [σ 2 /a 2 ] CVE-M >2 -threshold, meaning that the GLS estimators outperform, if only marginally, the other three. Figure 3b demonstrates that the scaled residuals (D

D. MD simulation and data analysis for ubiquitin
We tested our framework in a more biological context by simulating a 2 µs trajectory of a single ubiquitin molecule (PDB identification code: 1ubq 25 ) using the Amber99SB*-ILDN-Q force field [26][27][28][29] in a cubic simulation box with periodic boundary conditions and an approximate edge length of 7.5 nm. The protein was solvated by 13347 TIP4P-D water molecules at a concentration of ∼150 mM NaCl. 30 The equilibration and production runs were otherwise performed in the same manner as described in Sec. IV A.
In contrast to pure TIP4P-D water, we had only a single trajectory for the ubiquitin protein, which we analyzed in two ways: First, we applied the GLS estimators [Eqs. (10)] to MSD values computed for the full trajectory at different interval lengths. Then, we split the trajectory into 150 segments of equal length and treated each segment as an individual trajectory in the data analysis, thus allowing for a similar data analysis as performed on the water simulation data. In both cases, M = 20 MSD values were considered for the fits. The results of both approaches are presented in Fig. 4a, where we compare   Fig. 2b), the lower bound ∆topt was read off the quality factor plot, giving ∆topt = 6 ps. However, unlike for water, the average quality factor (solid blue line) is consistently above 1/2 for ∆tn > ∆topt, which is indicative of the trajectory segments being too short (see Fig. 5). The shaded area represents the uncertainty of our estimate, which was determined from the sample standard deviation.
the single-trajectory diffusion coefficient estimate D GLS to a sample average D GLS over the above-mentioned segments. The two estimates coincide almost perfectly and possible deviations are contained within the uncertainty interval of D GLS .
Yet, unlike for pure TIP4P-D water, the diffusion coefficient estimate for ubiquitin did not converge to a constant value at sufficiently large ∆t n , but instead crept to ever lower values. This might indicate some underlying non-diffusive dynamics on long time scales, but an analysis of the quality factor, depicted in Fig. 4b, showed that it already reached Q = 1/2 at ∆t opt = 6 ps and then remained essentially flat, thus implying a good fit to our diffusion model. At the optimal interval length ∆t opt , the noise parameter for ubiquitin was estimated as a 2 GLS = (1.9 ± 0.6) × 10 −4 nm 2 , corresponding to a signal-to-noise ratio of σ 2 /a 2 ≈ 11. A closer inspection revealed that Q consistently took values slightly greater than 1/2, indicating that we overestimated δD prediction GLS in this regime. This can also be seen in Fig. 4a, where the black dotted lines [Eq. (26)] clearly envelop the green shaded area [Eq. (25c)]. The elevated Q-values beyond ∆t opt are due to the trajectory segments being too short to produce Gaussian-distributed estimates for D (k) GLS . We also observed this effect for synthetic data generated via Eqs. (3), as well as our TIP4P-D water data when we drastically shortened the underlying time series. The effect is demonstrated in Fig. 5, where we plot the average quality factor computed from time series of different lengths. For short time series, the quality factor increases with the interval length both in TIP4P-D water data and in synthetic data for an ideal diffusion process. As shown in the inset of Fig. 5b, this is associated with deviations from Gaussian statistics for small sample sizes. Further-more, in the case of the TIP4P-D water data, the lower bound ∆t opt also seems affected by the shortening of the time series. This casts doubt on our initial decision to set ∆t opt = 6 ps for ubiquitin, because the optimal value is probably a few picoseconds higher.

E. Verifying short-time predictions on long time scales
Another indication that ∆t opt might be too low for ubiquitin is given by the fact that the uncertainty bounds at ∆t opt = 6 ps from the single-trajectory analysis do not account for all diffusion coefficient estimates obtained at longer recording-time intervals. To test whether this discrepancy is due to insufficient statistics or because our model is not entirely adequate to describe the translational dynamics of ubiquitin, we turned to the KS statistic [Eq. (24)], which we evaluated for our water data and the segmented ubiquitin trajectory, respectively. Figure 6a presents the KS test results for TIP4P-D water, which, unsurprisingly, confirm that the system dynamics remains diffusive on long time scales and is well characterized by the optimal diffusion coefficient D opt ≡ D(∆t opt ) = (2.086 ± 0.010) nm 2 ns −1 . For ubiquitin, we observed a significant discrepancy between the global minimum of S √ N s and D opt , as depicted in Fig. 6b. Yet, due to the large p-value associated with D opt , we could not reject the null hypothesis that the long-time dynamics of ubiquitin is diffusive with D ≈ D opt = (0.060 ± 0.002) nm 2 ns −1 . According to the KS test, the value of D opt estimated from the 150 segments, each of length 13.3 ns, is consistent with trajectory data at longer recording-time intervals of 28.5 ns and 66.7 ns, with p-values exceeding 0.5 (see Fig. 6b).

V. CONCLUSIONS
We have proposed a robust framework to extract reliable self-diffusion coefficients and their uncertainties from molecular dynamics simulation trajectories. We fit a diffusion process to the observed MSD curves using GLS estimators [Eqs. (10)], which account for strong correlations in the errors of MSD values at different time lags, and generally outperform other estimators at high and low signal-to-noise ratios. We allow for possible non-diffusive dynamics at short times by including an instantaneous Gaussian spread in the diffusion process. By sub-sampling the time series, we identify remaining deviations from regular diffusion at intermediate times. We extend the recording-time interval until the sub-sampled trajectories become statistically consistent with our diffusion process. The quality factor Q [Eq. (21)] used to quantify the fit consistency arises naturally in the context of GLS fitting procedures and provides a measure on whether the data are overfitted or underfitted. In this way, we obtain an estimate of the translational diffusion coefficient that optimally trades off possible systematic errors from non- diffusive dynamics at short times and statistical errors from increasing uncertainties in the MSD values at longer times.
Our framework can be readily applied to either a single trajectory or a set of trajectories with the help of our Python data analysis script. 16 In the former case, the full trajectory must be split into multiple segments of equal length to perform the quality factor analysis. As a proof of principle and to demonstrate its use in practice, we applied our framework to molecular dynamics simulation data. We estimated the translational diffusion coefficient of TIP4P-D water from an ensemble of 4139 water molecules and compared our results to literature values. We then went on to a system with much sparser statistics, namely a single ubiquitin molecule solvated in TIP4P-D water, where we performed a single-trajectory analysis on the full ubiquitin trajectory and compared the results to statistics obtained by analyzing multiple segments of said trajectory. We found that both approaches give almost identical diffusion coefficient estimates, albeit with very different uncertainties. An inspection of the quality factor for the segments revealed that our predictions for the uncertainty were adequate, but systematically a bit too high due to the segments being too short. With the help of a Kolmogorov-Smirnov test, we confirmed that our optimal diffusion coefficient estimates correctly predict the long-time dynamics observed in our trajectories. In this way, we effectively ruled out possible anomalous diffusion of the protein ubiquitin on intermediate and long time scales. Finally, our framework allowed us to identify systematic errors caused by well-established software packages for unwrapping particle trajectories from simulations at constant pressure, as detailed in the companion paper, Ref. 14.
All in all, our framework supplies the practitioner with an optimal diffusion coefficient estimate and an associated uncertainty with high precision. It furthermore provides evidence on how to optimize the quality of the fit by subsampling, thereby trading off systematic and statistical uncertainties, and whether the predicted overall uncertainty is of appropriate size or not.

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.
(iM − α)(jM − α) (M β − α 2 ) 2 Σ i,j (a 2 OLS , σ 2 OLS ) . Including more data points therefore worsens the OLS estimate of the diffusion coefficient. By contrast, at low signal-to-noise ratios σ 2 /a 2 other terms dominate the covariance, thus resulting in an uncertainty comparable to the GLS estimators for M N (see Fig. 1).

Appendix B: MSD covariance matrix
While the mean of MSD i is trivially given by the calculation of its second moment, with the shorthand notation δY n,i = Y n+i − Y n , is more involved. It essentially reduces to a sum of moments of the form A i B j C k D l , where A, B, C, D ∈ {Z, X, R, S}, which evaluate to for normally distributed processes. 31 Because of it becomes apparent that the only non-zero terms of (X n+i − X n ) 2 (X m+j − X m ) 2 , when expanded with respect to Z and S, are of even order, i.e., (B2) which we will come back to later, and use linear combinations of Z i Z j S k S l = σ 2 min(i, j)δ k,l , S i S j S k S l = R i R j R k R l = δ i,j δ k,l + δ i,l δ j,k + δ i,k δ j,l to calculate the remaining terms, giving δZ 2 n,i δS 2 m,j = 2iσ 2 , δZ n,i δS n,i δZ m,j δS m,j = σ 2 (δ m,n − δ m,n+i −δ m+j,n + δ m+j,n+i ) ×[min(m, n) − min(m, n + i) − min(m + j, n) + min(m + j, n + i)] , δS 2 n,i δZ 2 m,j = 2jσ 2 , δS 2 n,i δS 2 m,j = 2[2 + (1 + 2δ i,j )δ m,n + δ m,n+i +δ m+j,n + δ m+j,n+i ] .
Substituting all of these into our expression for the covariance matrix results in Eq. (8a).
For the special case of a ≡ 0, our covariance matrix reduces to Eq. (B2), where the ensemble average can be rewritten as follows, follows for arbitrary b i . For large signal-to-noise ratios, we thus have which can be used to derive Eqs. (12), because