Model selection of chaotic systems from data with hidden variables using sparse data assimilation

Many natural systems exhibit chaotic behaviour such as the weather, hydrology, neuroscience and population dynamics. Although many chaotic systems can be described by relatively simple dynamical equations, characterizing these systems can be challenging, due to sensitivity to initial conditions and difficulties in differentiating chaotic behavior from noise. Ideally, one wishes to find a parsimonious set of equations that describe a dynamical system. However, model selection is more challenging when only a subset of the variables are experimentally accessible. Manifold learning methods using time-delay embeddings can successfully reconstruct the underlying structure of the system from data with hidden variables, but not the equations. Recent work in sparse-optimization based model selection has enabled model discovery given a library of possible terms, but regression-based methods require measurements of all state variables. We present a method combining variational annealing -- a technique previously used for parameter estimation in chaotic systems with hidden variables -- with sparse optimization methods to perform model identification for chaotic systems with unmeasured variables. We applied the method to experimental data from an electrical circuit with Lorenz-system like behavior to successfully recover the circuit equations with two measured and one hidden variable. We discuss the robustness of our method to varying noise and manifold sampling using ground-truth time-series simulated from the classic Lorenz system.


Introduction
Hypothesis generation through data-driven model identification has the potential to revolutionise science. Uncovering the interactions, structure, and mechanisms that determine the behaviour of chaotic systems in particular could improve scientific understanding in almost every discipline with dynamical systems [30] including climate science [60], hydrology [59], population dynamics [32], and neuroscience [53]. Many chaotic systems can be informatively described by relatively simple dynamical equations. However, characterization and control of these systems can be challenging [11], due to sensitivity to initial conditions and difficulties in differentiating chaotic behavior from noise [64]. Characterization through statistical, geometric, or model-based means becomes more challenging when only a subset of the variables are experimentally accessible. Our goal is to identify a parsimonious set of equations to describe a chaotic system from measurements with hidden variables.
Much data-analysis for chaotic systems has focused on learning the attracting manifold structure from time-series. In the early 80s, Takens's theorem [65] describes the conditions under which one can use the time-delay embedding from a single variable to construct a manifold that preserves the topological properties of the full system. Takens's result formalized the idea that the information of the manifold structure, and therefore chaotic dynamics, could be recovered from the time-history of a single state variable. Manifold reconstruction methods [50,28,41] based on partial information provide insight into the system structure, dimensionality, and statistics of chaotic systems. By constructing manifolds from time-delays, Sugihara et al. developed methods discriminating chaos from noise [64] and detecting causality between measured variables [63].
Methods including reservoir computing [67,27], other deep learning frameworks [72], data assimilation combined with neural networks [15], support vector machine [48], and nearest neighbours [5] can accurately predict the dynamics of chaotic systems using a data-trained model with no specific physical knowledge of the system. For a review of predictive methods see [6]. Assuming a reasonable model structure is known, data-assimilation methods [8,9] including variational annealing [70] can estimate model parameters for chaotic systems from incomplete, indirect, and noisy measurements. Although these methods are designed to assimilate information from data-streams with hidden variables and learn about chaotic systems, they are not designed for the purpose of hypothesizing parsimonious models or identifying model structure.
Data-driven discovery of parsimonious dynamical systems models to describe chaotic systems is by no means new. Early on, least-squares fitting of combinatorial sets of polynomial basis functions to time-series data followed by information-theory based selection produced models that reproduced manifold structure and statistics of the system [21]. Symbolic regression demonstrated successful recovery of the widely accepted equations for the chaotic double-pendulum system [58]. More recently sparse regression [68,17,34], motivated model selection techniques such as SINDy [16], which recover the ground-truth equations for chaotic systems from a relatively large library of functions, without needing a computationally intensive combinatorial search. Other sparsity-promoting frameworks have improved upon robustness for chaotic systems equation recovery through integral formulations [57,51,47], data assimilation methods [12], Bayesian frameworks [13], and entropic regression [4]. However, all these methods require measurements of all state-variables that significantly impact the desired dynamic. Notably, Champion et al. recently used an autoencoder framework for automatic discovery system coordinates and equations, but required input time-series of a higher dimension than intrinsic dimension of the system [18].
Model selection with hidden variables require different methodology. By 'hidden variables' we mean that the number of measured variables is smaller than the intrinsic dimension of the system. Measured variables are not considered hidden if they are corrupted by noise or indirectly sampled through a measurement function. A few methods address the problem of model selection with hidden variables, but they have not been demonstrated for chaotic systems. For example, Daniels et al. [22,23] combinatorially fit each model in a predefined model space using data assimilation and subsequently use Bayesian inference to select the best model. Successful recovery of massaction kinetic systems for chemical reactions was demonstrated with hidden variables using a neural network approach [37]. A recent method uses LASSO to select predictive models for chaotic systems from a library with higher order derivatives given a single state variable [61]. This method effectively finds a higher-order ODE representation of the Lorenz and Rössler systems, but it is unclear how the recovered structures relate to the ground truth models.
In this paper we present a new method to perform model selection in dynamical systems with hidden variables. This method combines the data assimilation technique variational annealing, which has been used to estimate parameters when the structure of the system is known, with sparse model selection via hard thresholding. We call this method Data Assimilation for Hidden, Sparse Inference (DAHSI). To demonstrate that our method could identify interpretable models for chaotic systems, we followed the philosophy of earlier works [58,16] and demonstrated recovery of accepted parsimonious models from experimental data and simulated time-series where the ground truth is known. In the Results section DAHSI successfully selected a set of models for a circuit that has Lorenz-like behaviour from experimental data of two state variables (one hidden). One of the identified models has the same structure as the Lorenz system. The other identified models with high AIC/BIC support exhibit nearly indistinguishable dynamics and suggest novel terms which may better represent the experimental circuit system. Moreover, we used ground truth simulations of the canonical Lorenz system to study how our method performs with varying data size and noise. In the Materials and Methods section we describe the DAHSI algorithm for model selection with hidden variables.
2 Results: Model selection for chaotic systems 2.1 Identification of models for the Lorenz circuit from experimental data The Lorenz system [42] was originally developed to forecast the weather and has become a canonical example when developing new methods to characterize chaotic systems. To demonstrate model selection on experimental data with hidden variables, we considered high-quality data from the electrical circuit in Blakely et al. [10] (Fig. 1(a)). This system exhibits similar structure and behavior to the highly studied Lorenz system and is well described by relatively simple circuit The structure of this system is similar to the Lorenz system, but in the standard Lorenz formulation ε =η =γ. Here, X = (x, y, z) denote the voltages across the capacitors C 1 , C 2 and C 3 in the circuit ( Fig. 1(a)). The measured variables are x and z, and y is unmeasured or hidden. We denote the noisy measurements of x and z by x e and z e , respectively, and the measurement function h((x, z)) = (x e , z e ) = Y. The experimental sampling rate is ∆t e = 80 ns resulting in 55, 000 time points. A low-pass filter was applied to remove high-frequency measurement error [10]. We re-scaled the experimental time by ∆t = ∆te 1.6×10 −5 ns = 0.005 so that the right hand side terms of (1)-(3) are around O(1). We trained our method with N = 501 time points ( Fig. 1(a)), at a sampling rate 2∆t = 0.01 (re-scaled). The attractor is reasonably well sampled with 501 points (SI Appendix, Fig. 2), and we retain the remaining data for validation.
We demonstrated model identification with hidden variables of the Lorenz-like system ((1)-(3)) using DAHSI (Fig. 1). First, we constructed a model library based on domain-knowledge.
In this case we used monomials up to degree two in three variables, representing 10 9 possible models composed of subsets of possible terms. From this library we generated a generic governing equation for each variable via the linear combination of all the candidate functions ( Fig. 1(b1)).
Our goal was to find a small subset of candidate functions within the library which describe the dynamics in the data. We did not assume that we knew the "correct" model complexity a priori, and searched for the set of models which balance error and simplicity.
To perform model selection, we minimised a cost function composed of the measurement error, A E , model error, A M , and sparse penalty, λ p 1 as a function of the parameters, p and library of functions Θ ( Fig. 1 (b), and Materials and Methods section). The model error contains the coupling between variables, taking advantage of the information about hidden variables in the time-history of the measured variables. The measurement error only depends on the measurements and measured variables estimated from the model. Model selection is enabled through the sparse penalty which determines the number of parameters, p k,j , that will be active in the model or zero.
To minimize the cost function, we used variational annealing (VA) [70], a data-assimilation technique for non-convex parameter estimation in nonlinear, chaotic systems. The problem is highly non-convex, with many local minima, due to the incoherence between the data and the model [52,1]. Decreasing the information or measurements [38] and increasing the number of terms in the library will both increase the number of local minima (SI Appendix, Fig. 1). VA works by varying R f which sets the balance between model error and measurement error ( Fig.   1(b2)). When R f = 0, only measurement error contributes leading to a convex cost function with an easy to find global minima. As the model is enforced by gradually increasing R f , the landscape increases in complexity and many local minima appear. By initialising the search near the minima for the previous R f the solution remains near the first minima found. Varying λ leads to different model structures or candidate models. As the penalty strength, λ, increases, the global minima moves to 0 in a larger number of parameters ( Fig. 1(b2)). Because there are many local minima, we need to choose N I = 500 random initial guesses to fully explore the landscape. The sparse-variational annealing process generates 169 candidate models, which must be further down-selected and validated to complete the model-selection process. We down-selected to the 25 models (SI Appendix) with a cost function value less than 10 −3 ( Fig. 1(c)). In our system there is a clear gap in cost-function value at this value, but the criteria and gap size will be system dependent. To ensure we have the best parameter fit for each down-selected model we performed parameter estimation via VA without sparsity constraint.
To validate the models, we needed to estimate an initial condition for the hidden variable y, for which there is no experimental data. We used an 8th order finite difference approximation of the time derivative of x for each model structure and solve the resulting algebraic equation for y 0 (SI Appendix). We used the dynamic equation for x since all down selected models contain y but not any higher order y terms. Estimation of the initial condition for hidden variables is only possible after the candidate models are found and must be done for the initial condition of each segment of validation data. This procedure takes advantage of Takens's theorem that the information in y is available in the time-delay of x.
Validation within the Lyapunov time ensures that the time-series do not diverge due to the inherent sensitivity to differences in initial conditions introduced by measurement and numerical error. All down-selected models have a similar Lyapunov time around 0.9 time units. We considered S = 1083 segments of the experimental data (excluding the training set), each of length 1/4 of a Lyapunov time to calculate the sum of the average error for each model ( Fig. 1(d)).
We discarded the first four points of each time segment as these points were used to predict the initial condition for y. The average error for the s-th time segment of the m-th model is defined , where x i,e and z i,e are the x and z components of the experimental data, respectively, and i is the time index. The sum of all average errors over the time segments S of the m-th model is E av,m = S s=1 E s av,m . The candidate models on the Pareto front ( Fig. 1(d), and SI Appendix, Table S1) best balance model complexity and error ( Fig. 1(e)). We successfully recovered the Lorenz-like structure derived by Blakely et al. [10], which has the lowest average error of recovered models with 7 active terms.
For λ = 3.9 the system presented in [10] is selected for 10.6% of the N I = 500 randomly chosen initialisation. However, we have no guarantee that this model is the "true model" for the circuit system. All models have a similar manifold structure ( Fig. 1(e)) and low error within a Lyapunov time ( Fig. 1(f)). We believe the main limitation of our prediction window is the uncertainty introduced by the hidden variable into the parameter estimation during VA. This uncertainty then propagates into the y 0 estimate required for each validation data set and magnifies noise (SI Appendix, Figs. S4 and S5). Given the difficulty in selecting between proposed chaotic models that exhibit such similar behaviour [2,3], the primary goal of DAHSI as a model identification method is to generate possible models. However, we were able to consistently identify a unique model (salmon with 11 terms, Fig. 1(f)) with the most support using Akaike information criteria as done in [45] and Bayesian information criteria (SI Appendix, Fig. 6), as well as identifying a weakly supported model (gold with 10 terms, Fig. 1(f)). By generating multiple models that lie near the Pareto front DAHSI has effectively generated hypothesis for additional terms, which could be tested with further experimentation.
While DAHSI identified the same equation terms as Lorenz and the circuit formulation from [10], the parameters fit through the final step of VA are not the same. We compare the ability to predict the experimental data with the classical Lorenz system, the circuit formulation from [10] and the DAHSI-recovered models, each of which have a different number of free parameters (Table 1). We perform parameter estimation via VA for each model and use the validation data-set described above to calculate E av , ∆AIC, and ∆BIC. Although the average error is similar for the VA-estimated circuit formulation and all DAHSI models, the DAHSI recovered models with 10 and 11 terms have substantially more ∆AIC and ∆BIC support. The classical Lorenz parameter structure, which only has 4 free parameters, is unable to capture the dynamics of the system. The parameters estimated via VA for the circuit model with 6 free parameters perform much better than those estimated from first principles [10]. Notably, the parameters estimated for the 7-term DAHSI model are very close to the parameters estimated for the original circuit model. Further experimentation is needed to determine if the coefficients in theẋ equation should be equal, p 1,2 = p 1,3 , and if the coefficient on the y term in theẏ equation, p 2,3 should be positive, negative, or zero ( Fig. 1(e)). The additional terms suggested by the 10 and 11 term DAHSI recovered models are strongly supported by the AIC/BIC calculations, but would require further experimentation to conclusively validate. They may represent parasitic resistances or other physical effects which have a small but real impact on the circuit dynamics and were neglected during the original derivation by Blakely et al. [10]. Recovery of the Lorenz-like model and identification of other models with AIC/BIC support demonstrates that DAHSI can successfully identify parsimonious models for chaotic systems. Table 1: Parameter estimation for the classical Lorenz formulation (4 free parameters, p 1,2 = p 1,3 , p 2,3 = p 2,7 = −p 3,6 ); the circuit formulation in [10] (6 free parameters, p 1,2 = p 1,3 ); and the DAHSIrecovered models.

Robustness study on the simulated Lorenz system
To study the robustness of our method to varying noise and manifold sampling we used groundtruth time series simulated from the classic Lorenz system, where σ = 10, ρ = 28, and β = 8/3. We numerically simulated the system using Runge-Kutta 4th order and a time step of ∆t = 0.01 and N = 501, producing time-series similar to the experimental data set. As in the experimental data set, we considered y to be the hidden variable. We studied the recovery rates of DAHSI as a function of the VA tuning parameter, α, and found trends similar to previous work [55], (SI Appendix, Table S3).
First, we studied the robustness of our method to measurement error modeled as additive Gaussian noise of mean zero and varying standard deviation, ω. Therefore, the measurement function is h(X) = X+N (0, ω). We expect that different noise instances, controlled by the random number generator seed, will change our recovery rate due to random corruption of essential parts of the data or overall poor manifold sampling.
We calculated recovery for 3 different standard deviations of noise with 20 noise seeds each and calculate the cumulative distribution function of the recovery rate ( Fig. 2(a)). The random noise seeds produced wide variation in recovery rate between 10-90% for the lowest noise, indicating that the minimal data set used here is not very robust. As the noise strength increased, the cumulative distributions shifted left as more seeds have lower recover. Setting ω = 0.01 produced a binomial distribution, with either a high recovery rate (> 80%) (the majority of simulations), or a low recovery rate (< 15%). For ω = 0.05 there were some seeds with intermediate recovery rates, more low recovery rates, and a few seeds that with a very high recovery rate. The noise level dramatically affected the recovery rate for ω = 0.1. The vast majority of simulations led to less than 10% recovery. More than half had 0% recovery, and only one had higher than 85% recovery.
Next, we investigated how manifold sampling affected the recovery rate of our system. We chose 3 different noise seeds, and varied the number number of time points N by increasing the length of the time-series ( Fig. 2(b)). Varying the length of the time-series changed the sampling of the manifold, demonstrating that sampling lobe transitions is crucial for accurate model recovery.
For one seed (light blue line) the recovery was high for N = 501 through N = 401. There were sharp drops in recovery of ≈ 60% and ≈ 15% when the data-set lost a lobe crossing in the attractor, as happens at N = 351 and N = 301, respectively. Sharp drops in another seed (dark blue line) also occurred when the sampling of the crossing between lobes is reduced at N = 460 and 301.
Decreased sampling of each lobe did not appear to have as dramatic an effect (N = 401 to 460).
The increase of recovery rate for the dark-blue noise instance at N = 351 suggests that optimal sampling requires some nontrivial balance of different dynamic regions. The specific corruption of noise instance had a big impact on how many crossings are needed to get a high recovery as the recovery was consistently high for one seed (cyan). These results suggest that optimal manifold sampling to counter noise corruption would vastly improve DAHSI performance on data sets with high noise.

Discussion
In this paper we have presented DAHSI, a method to identify non-linear dynamical systems from data with hidden variables. DAHSI combines variational annealing, a data assimilation technique, with sparse thresholding. We applied DAHSI to an experimental data set from a circuit that exhibits Lorenz-like dynamics [10]. The outcome is a set of candidate models, including a model with the same Lorenz-like structure derived by Blakely et al. from circuit equations [10]. Two additional parsimonious models with strong support based on AIC/BIC-based validation were also identified. The unanticipated terms suggested by these models may represent real physical processes in the circuit, such parasitic resistances or other factors not included in the idealized model derivation. Through this example, we demonstrated that DAHSI works as an effective tool for generating models and functional hypothesis from data.
To analyze recovery and the effects of noise and manifold sampling in a system where we know the ground truth, we studied the performance of DAHSI applied to simulated time-series from the classical Lorenz system. Notably, we successfully selected the ground truth model as most likely from those generated by DAHSI using information-criteria based validation techniques (SI Appendix. Fig. 3). Our noise studies showed recovery rates of 80% for ∼ N (0, 0.01) and 10% or lower for ∼ N (0, 0.1). Therefore we anticipate that the current formulation of DAHSI will have reasonable recovery rates for noise levels < 10% of the signal value. Further robustness to noise could be achieved through integral formulations similar to those used for sparse regression, rather than the discretized mapping between time-points used here [57,51,47]. Manifold sampling impacts recovery and we conclude that recovery is especially sensitive to sampling at the saddle point transition between the lobes. Moreover, the noise seed used to generate the synthetic data impacts the recovery and we suspect this is due to random corruption of measurements from different regions of the manifold. For chaotic systems, increasing the time of experiment will eventually ensure robust sampling of the manifold. However, the computational time of DAHSI scales with the length of the input time-series [31]. Therefore, we anticipate that short bursts of time-series designed to optimally sample the manifold would provide optimal sampling and computational efficiency. Further metrics for analyzing the information content of our data and minimal data-requirements for recovering models [35] would lead to optimal manifold sampling.
One of the main benefits of a sparse model selection framework is that we identify likely model structures while avoiding combinatorial testing and validation of all potential models. For example, the number possible models described three variables with monomials up to degree two is approximately 10 9 . Doing parameter estimation on each of these models and validating would be computationally intractable, taking at least 10 5 processor-days with our setup. does not scale monotonically with library size. Instead, we find that a library with 10 terms can take 100 times longer to run than the library of 30 monomials. We suspect that the variation in optimization time depends on correlations between library functions [43], model symmetries, and other structural features.
In addition to the chaotic systems presented in the results, we have applied DAHSI on two non-chaotic systems: on time-series data from a Lotka-Volterra-like system with no hidden variables and on simulated time-series for a mass action kinetics systems with hidden variables. Although DAHSI recovered reasonable models for both systems, there are several caviats. Recover of Lotka-Volterra required an iterative formulation (SI Appendix, Fig. 15). We also compared DAHSI to SINDy [16] for the Lotka-Volterra system and found that SINDy was far superior in speed when all variables are observed. Recall that a comparison between DAHSI and SINDy is not possible for Lorenz-like circuit system, as SINDy requires access to the unmeasured y variable. The mass action kinetic system modeled a semiconductor with two trap levels differing by one electronic unit of charge (SI Appendix). The recovery rate for the ground truth model was low, around 3%.
Unlike chaotic systems, which are highly non-convex, the mass-action kinetic system has a very flat cost function due to structural parameter identifiability issues (SI Appendix, Figs. S11-S14), [7,29,46,25]. Stochastic gradient decent algorithms such as IPOPT are known to perform poorly for flat cost functions so switching to an optimiser designed for such systems [40] may improve recovery. Other data-assimilation methods for parameter estimation with hidden variables such as 3D-Var, 4D-Var, Kalman filtering, and hybrid methods [8] may be more cost-effective if VA is unnecessary to navigate to the global minimum of a highly non-convex function.
The formulation of cost function and sparsity constraint also likely impacts recovery. Different methods for sparse model-selection include stepwise and all-subsets regression, ridge regression [36], LASSO [68], least angle regression [24], and SR3 [73]. SR3 accelerates convergence and has been shown to outperform other methods and improves performance but has an extra tuning parameter. The parameter path for the first four methods is shown to be different in [34] and therefore, we expect that different regularisation methods will lead to different model identification.
Comparison between different sparsity-enforcement mechanisms within DAHSI framework could improve recovery but may be somewhat system dependent.
We anticipate many future applications and extensions of DAHSI. The framework for DAHSI does not have any intrinsic restrictions about the functional form of the equations, in particular the function library need not be linear in the unknown parameters. Variational annealing is designed to handle stochasticity through the model error. In addition, data assimilation is commonly used for PDE systems, including PDE discovery [19]. Therefore, we anticipate we can apply or extend our framework to broader applications, without reformulation as was needed in sparse-regression based frameworks for rational functions [44], stochastic systems [14], and PDEs [56,39]. Modifications to the optimization methodology and further investigation of optimal data-sampling strategies could improve the computational efficiency of DAHSI, opening up higher dimensional problems to model selection with hidden variables.

Methods: Mathematical formulation of cost function and algorithm
The dynamics of many physical systems can be described by models with only a few terms. Our goal is to retrieve the sparse system representation of these type of systems given the measurements of some, but not all, of the state variables. We consider a dynamical system with unknown where X = (x 1 , x 2 , . . . , x D ) ∈ R D are the state variables, F = (F 1 , F 2 , . . . , F D ) are the unknown functions that govern the dynamics of the system and p is a set of unknown parameters. The function capturing the nonlinear dynamics of each state variable, F k , is assumed to be sparse in function space as has been done previously [33,16]. Given a library of possible functions Θ = (θ 1 , θ 2 , . . . , θ q ), we can write a candidate functionF k aŝ for k = 1, 2, . . . , D. There is no inherent restriction that the functions be linearly additive. The set of p k,j defines the vector p ∈ R P , where P = Dq is the total number of unknown parameters.
We want to estimate the unknown parameters p k,j and all state variables X using only the measurements Y with the constraint that p is sparse. This is equivalent to minimising the negative log likelihood Here, f (X(t i ), p,F) = X(t i+1 ) defines the discrete time model dynamics and is obtained by discretising (86) using a Hermite-Simpson collocation. We note that if λ = 0 in 9 we obtain the cost function used in VA. Following the statistical derivation in [66,26,1], the experimental error, 2 assumes Gaussian noise and the model error, , p,F) 2 assumes a relaxed delta function. We assume that the state at the t i+1 depends only on the state at t i . We assume that each element in p follows a Laplace distribution with mean 0 (SI Appendix). The details and necessary background to minimise (9) are presented in the following sections.

DAHSI: Data Assimilation for Hidden Sparse Inference
Our algorithm, Data Assimilation for Hidden Sparse Inference (DAHSI), performs model identification for chaotic systems from data with hidden variables. It combines the data assimilation technique VA with sparse thresholding (Fig. 3(a)). The code base for DAHSI can be found at [54].
As the desired model complexity is unknown ahead of time, DAHSI sweeps through different hard-threshold values, λ. For each λ, the cost function (9), is minimized by iterating between VA [70,71] and hard-thresholding of the parameters. We chose the iterative framework over direct incorporation of the 1 penalty into the minimized cost function, based on the results that show that least square with thresholding converges locally, often outperforming convex variants [73,20], and recent demonstrations that LASSO makes mistakes early in the recovery pathway [62].
At each VA step, we minimize A E + R f A M , which is 4DVar in its "weak" formulation [66,26], over X and p given R f using IPOPT, an optimisation package that uses a gradient descent method [69]. The state variables X ini are initialized as Y for the measured states and random values from a uniform distribution within specified bounds for the unmeasured states. Since we expect the parameter vector p to be sparse, it is initialized as p ini = 0.
Initially R f takes some small value R f,0 = , as R f = 0 would lead to an unconstrained solution on the unmeasured states and p. At each step β = 0, 1, 2, . . . , β max of VA, R f is updated to R f = R f,0 α β , for α > 1. After each step β of VA, we enforce sparsity by applying a hard threshold, λ, to p (β) . The solution, {X (β) , p (β) }, at each step of the VA process is used as the initialization for the next step. We choose β max so that the cost function plateaus, Fig. 3(b), and our final solution is {X fin , p fin }. Because there are many local minima, we run N I different initial guesses to fully explore the landscape of A E +R f A M . It is important to note that the same λ yields multiple models due to the N I different initializations of the unmeasured states. For example, if we consider N I = 500 with a fixed λ = 3.9 in our Example 2.1, we find a total of 20 models ( Fig.   3(b)).
To produce candidate models with varying sparsity, the entire β sweep with VA and thresholding is repeated for each λ. As with other model identification methods, different λ will yield different models (for the same initialisation of unmeasured states). For one particular initialisation in Example 2.1, with λ = 3.8 the term z is selected in the first equation of the system. With larger λ = 3.9, the term z is no longer selected (Fig. 3(c)). Although the same λ yields multiple models due to the difference of the initial choice of unmeasured states, as we would expect, higher values of λ produce models with fewer active terms (Fig. 3(d)). Input: measurements Y, generic model library Θ, λ max , β max , α 3: Calculate discrete functionF from Θ 4: for l = 1 : L do 5: x l = y l Fit measurements to data 6: Randomly initialise unobserved variables {x l+1 , . . . , x D } 7: Initialise p ini = 0 Force sparsity 9: Assemble pair {X ini , p ini } 10: while λ < λ max do 12: for β = 0 : β max do Variational Annealing 13:

S1 Cost function analysis
Our aim is now to explore the landscape of the cost function as to understand the problem that we are solving and why it is very challenging. For illustrative purposes, in the following discussion we are only considering two dimensions of the cost functionÂ = A E + R f A M . We use the classical Lorenz system and take all parameters in the structure fixed and we add two extra parameters We then vary these two parameters and plot what the cost function looks like, for three different values of R f . The cost function that we want to minimise, is the one that has a large R f value ( Fig. 1, right). The cost functionÂ is highly non-convex and the task of finding its global minima a priori is a difficult task.

S3 AIC calculation for synthetic data
We want to find which model is the one that best represents the data synthetic data generated (in which we added some noise ∼ N (0, 0.01)). Since we are working with chaotic systems, we only expect prediction up to the Lyapunov time of the system. We consider 1/4 of the shortest Lyapunov time out of all the down-selected models for the synthetic data, t M ≈ 0.3. We use S = 300 time series of length t M as our validation set, but discard the first four points as they will be used to predict the initial condition for y 0 (as shown in the following section). To calculate the AIC score, we define the residual sum of squares of the m-th model as where Y s = [x e , z e ] s is the synthetic data of the time-series s, F m the governing equations of the m-th model, and p m denotes the parameters found via parameter estimation for the m-th model. E s av,m is the average absolute error over the time-series s and is defined as Finally, we can define the AIC of the m-th model as where N p,m is the number of free parameters in the m-th model.
We finally re-scale by the minimum AIC value, denoted by AIC min , and so ∆AIC m = AIC m − AIC min .

S3.1 Initial condition choice for unmeasured y
We need an initial condition for each time series to be able to simulate each model. We have an initial condition for both x and z given by the experimental data, but we do not have any information for the y component. We cannot use the VA to estimate y 0 and parameters simultaneously (which would lead to better prediction windows see next section) because our validation data will then have been used for training. Let us consider the 8th order finite difference approximation of For each model, we have that Putting (7) and (8) together we have We need to solve for y(0). We note that for the down-selected models in Example A in our manuscript the only terms with y in the first equation in all the models is just the first order term, so for this case this is a particularly simple equation to solve.
The results in the synthetic data indicate that there are only four candidate models that best represent the data. Even though the ∆AIC from incorrect models (Fig. 3, red, green and yellow lines) does not increase as we add more time series S in the calculation of AIC, we consistently pick the correct model structure (blue line) as the one with lowest ∆AIC.

S3.2 Prediction window
We compare how the prediction window changes from having two observed variables to having three observed variables. For noise ω = 0.01, having one hidden variable (Fig. 4, top row) and using the real value of y 0 , leads to no prediction at all. However, using the estimated y 0 calculated as in the previous section leads to a prediction window of about 3.5 Lyapunov times. This shows that the parameter estimates and y 0 estimate are compensating for each other. For the case of all variables observed (Fig. 4, bottom row), we see that using the real value of y 0 leads to a prediction window of about 6 Lyapunov times. If we estimate y 0 the prediction window reduces to about 3.5 Lyapunov times. For a higher noise ω = 0.1, having one hidden variable (Fig. 5, top row) and using the real value of y 0 , again leads to no prediction at all. Moreover, using the estimated y 0 calculated as in the previous section leads to a shorter prediction window than for lower noise,   N (0, ω), ω = 0.01.  N (0, ω), ω = 0.1.

S4 Down-selected models
We present the structure of the 25 down-selected models, but we do not provide the parameter estimation (the parameter values can be found in the code package [54]).

S4.2 AIC and BIC on the 25 down-selected models
Bayesian information criteria (BIC) is defined as In the same way when we defined ∆AIC in a previous section, we re-scale by the minimum BIC value, denoted by BIC min , and so ∆BIC m = BIC m − BIC min .
We will now calculate how AIC ( (6)) and BIC ((85)) change as we add more time series into the calculation. For each S that we use to calculate both AIC and BIC (S ≤ 1083, which is the total number of time segments we have available that are of length 1/4 of a Lyapunov time), we will pick S random time-segments to ensure that the S time-segments used in the calculation are independent samples.
For both ∆AIC m and ∆BIC m we are able to consistently identify a unique model (Fig. 6).
If we just look at the Pareto front ( Fig. 1(d) in the main text), one might ask if the decrease between 9 and 10 terms is meaningful. Both AIC and BIC say that it is. Figure 6: ∆AIC m and ∆BIC m from the different models DAHSI found using the experimental data in [10].

S5 Action derivation
We consider a dynamical system with unknown governing equations where X = (x 1 , x 2 , . . . , x D ) ∈ R D are the state variables, F = (F 1 , F 2 , . . . , F D ) are the unknown functions that govern the dynamics of the system and p is a set of unknown parameters. Te measurements Y = (y 1 , y 2 , . . . , y L ) ∈ R L are lower dimensional L ≤ D than the underlying variables.
Our goal is to find X and p that maximise the probability P (X, p | Y,F). We have that [1] Furthermore, We make the following assumptions: With assumption 1 it can be shown that For the second term in the sum, we need to find an expression for P (X(t i+1 ), p | X(t i ),F).
Let us now focus on the k-th component of X(t i+1 ), and so our goal is to find an expression for We consider the library of q possible functions and the generic expression for each equation of our model:F k :=F k (X, p) = p k,1 θ 1 (X) + p k,2 θ 2 (X) + · · · + p k,q θ q (X), for k = 1, 2, . . . , D.
We can rewrite the probability we are seeking as Now each term in the right hand side can also be rewritten as Thus, (91) becomes We can rewrite the first therm on the right hand side in (94) as a likelihood, Assuming that our next state follows a normal distribution with mean f k and standard deviation With assumption 3, we know that each p k,j follows a Laplace distribution, and so With this we can write (94) as Note that since we are going to be minimising the action A 0 ((88)) we forget about the constant term P (X(t i ), F k ) in the denominator and we just have a proportionality instead of an equality.
Note that because the k-th current state only depends upon the previous one, and so, finally, we can write Upon taking the logarithm to this expression above, where λ = q/b.
We have seen that (88) becomes which is what we wanted to show.

S6 Computational time
We use the Lorenz system, with all variables observed, N = 1001 time points, ∆t = 0.01, and no noise. The more terms our library Θ, the more time it takes to evaluate the cost function associated, its Jacobian and its Hessian (Fig. 7(left)). However, due to model symmetries and other structural features, the time to run our algorithm does not monotonically increase with increasing number of terms in our library. A library with 10 terms can take 100 times more to run than the full library of 30 monomials (Fig. 7(right)).

S7 Semiconductor
We consider this semiconductor model (T trap levels with two possible states differing by one electronic unit of charge), dx dt = e n,01 y − R n,10 xz, dy dt = −e n,01 y + R n,10 xz, dz dt = e n,01 y − R n,10 xz.
x denotes the number of electrons in the conduction band, y denotes the number of traps with 2 electrons, and z denotes the number of traps with 1 electron. We chose e n,01 = 0.5 and R n,10 = 0.25. Instead of using the library of all monomials in three variables up to degree two, we know that there are only a few terms make sense physically. Our generic model for this example is dx dt = p 1,1 + p 1,2 x + p 1,3 y + p 1,4 z + p 1,5 x 2 + p 1,7 xz, dy dt = p 2,1 + p 2,2 x + p 2,3 y + p 2,4 z + p 2,5 x 2 + p 2,7 xz, dz dt = p 3,1 + p 3,2 x + p 3,3 y + p 3,4 z + p 3,7 xz.
We first consider three observed variables, D = L = 3. We consider a time series of N = 101 equally spaced time points, with ∆t = 0.01. The λ sweep results in a different amount of active terms for each value. See Fig. 9 (left). Since we know the model from which our data comes from, we just want to see if the model that has the right number of terms (highlighted in red) corresponds to our original one, which it does.

S7.1 1 hidden variable
We consider two observed variables, L = 2. We pick x and y. We run N I = 1, 000 different initialisations. There is a question in this particular case on how the initial guess should be picked (see Algorithm 2. We do a λ sweep from λ = 0.1 through λ = 0.3. Out of all the 1,000 different initialisations, we recover the right sparsity pattern 68 times. The optimal λ = 0.19, for which we recover the right sparsity pattern 33 times (see Fig. 10).
observed hidden N ∆ t β max λ recovery 2 1 (z) 101 0.01 30 0.19 3.3% Table 2: Recovery of the semiconductor system with one hidden variable.   Pick at random one of the observed variables.

3:
dX d ← Calculate gradient vector from its time series.

4:
while Z d out of bounds do Make sure unmeasured variable is within bounds 5: Z d (t 1 ) ← Random initial condition for unobserved variable within bounds.

S7.2 Parameter identifiability
There are two main reasons of why a parameter might not be identifiable: said parameter does not influence the model output; there is a interdependence among different parameters, that is, one can compensate the change of one parameter (that would influence the model output) by changing other parameter(s) and have the output be the same. In this section, we focus on the latter.
One way to detect pairwise interplay is by plotting contours of the cost function versus pairs of parameters. Largely eccentric contours or valleys show that the cost function is almost unchanged in one direction, and the two parameters are highly correlated. The main drawback for our case in particular is that we will be limited to find relationships only between pairs of parameters instead of higher dimensional interactions.
We will now add only two extra parameters (two of the black terms) at a time. Each Fig. 11-14 is obtained by picking one term (parameter 1, which is the first extra term in the system), and then study the cost function by adding another term (parameter 2, which is the second extra term in the system). We study this for all the possibles "parameter 2".
These four figures already show identifiability problems.

S8 Predator-Prey
Although it is not very common to find pure predator-prey interactions in nature, there is a classical set of data by the Hudson Bay company which corresponds the number of snowshoe hares and Canadian lynxes trapped in Canada, which in turn shows the relative population of both [49]. The data is recorded yearly, so ∆t = 1. We use data between 1900 and 1920, thus N = 21. In this particular case we really do not know the dynamics behind the system although we know that the snowshoe hare is the primary food of the lynx. Therefore, we can assume that we have a predator-prey system, and there is the classical Lotka-Volterra model to describe these type of dynamics. We consider L = D = 2 ( Fig. 15(a)). We build the library of functions with all the monomials up to degree two in two variables, and with it we construct our generic model ( Fig. 15(b)). We run our algorithm and varying λ we obtain a list of possible models. By looking at the corresponding AIC values for each one, we find that the model with 7 active terms is the best one ( Fig. 15(d)). We now consider that our generic model is the resulting model with 7 active terms. Again, we run the algorithm to find that the best model is one containing only 5 terms ( Fig. 15(f-h)). We iterate this process, and run the algorithm considering the model with 5 active terms as the generic one. We find that the best model is the one containing 4 terms ( Fig.   15(i-k)). This identified model corresponds to the Lotka-Volterra one. Once we do only parameter estimation on it, we obtain the dynamical system shown in Fig. 15(k). We compare the original data (dashed) with the resulting model (solid), which show an excellent match.  Figure 15: The recovery of the Lotka-Volterra system required an iterative formulation which consisted of down-selecting relevant monomials to describe the dynamics via AIC at the end of the variational annealing and start the algorithm again with less terms in the generic model description.

S9 α parameter in VA algorithm
We study how the parameter α used to increase the value of R f = R f,0 α β during the VA algorithm affects the recovery. We use the class Lorenz system, where σ = 10, ρ = 28, and β = 8/3. We numerically simulate the system using Runge-Kutta 4th order and a time step of ∆t = 0.01, producing time-series similar to the experimental data set. We add some error modeled as additive Gaussian noise of mean zero and standard deviation ω = 0.01. Therefore, the measurement function is h(X) = X + N (0, ω). We consider N = 501, and y to be the hidden variable.
As we increase α the recovery rate decreases, and for α ≥ 1.3 the recovery is 0% (Table 3).