Combining Machine Learning with Knowledge-Based Modeling for Scalable Forecasting and Subgrid-Scale Closure of Large, Complex, Spatiotemporal Systems

We consider the commonly encountered situation (e.g., in weather forecasting) where the goal is to predict the time evolution of a large, spatiotemporally chaotic dynamical system when we have access to both time series data of previous system states and an imperfect model of the full system dynamics. Specifically, we attempt to utilize machine learning as the essential tool for integrating the use of past data into predictions. In order to facilitate scalability to the common scenario of interest where the spatiotemporally chaotic system is very large and complex, we propose combining two approaches:(i) a parallel machine learning prediction scheme; and (ii) a hybrid technique, for a composite prediction system composed of a knowledge-based component and a machine-learning-based component. We demonstrate that not only can this method combining (i) and (ii) be scaled to give excellent performance for very large systems, but also that the length of time series data needed to train our multiple, parallel machine learning components is dramatically less than that necessary without parallelization. Furthermore, considering cases where computational realization of the knowledge-based component does not resolve subgrid-scale processes, our scheme is able to use training data to incorporate the effect of the unresolved short-scale dynamics upon the resolved longer-scale dynamics ("subgrid-scale closure").


Abstract
We consider the commonly encountered situation (e.g., in weather forecasting) where the goal is to predict the time evolution of a large, spatiotemporally chaotic dynamical system when we have access to both time series data of previous system states and an imperfect model of the full system dynamics. Specifically, we attempt to utilize machine learning as the essential tool for integrating the use of past data into predictions. In order to facilitate scalability to the common scenario of interest where the spatiotemporally chaotic system is very large and complex, we propose combining two approaches:(i) a parallel machine learning prediction scheme; and (ii) a hybrid technique, for a composite prediction system composed of a knowledge-based component and a machine-learningbased component. We demonstrate that not only can this method combining (i) and (ii) be scaled to give excellent performance for very large systems, but also that the length of time series data needed to train our multiple, parallel machine learning components is dramatically less than that necessary without parallelization. Furthermore, considering cases where computational realization of the knowledge-based component does not resolve subgrid-scale processes, our scheme is able to use training data to incorporate the effect of the unresolved short-scale dynamics upon the resolved longer-scale dynamics ("subgrid-scale closure").

I. INTRODUCTION
In recent years, machine learning techniques have been used to solve a number of complex modeling problems ranging from effective translation between hundreds of different human languages [1] to predicting the bioactivity of small molecules for drug discovery [2]. Typically, the most impressive results have been obtained using artificial neural networks with many hidden neural states [3]. These hidden layers form a "black box" model where internal parameters are trained given a set of measured training data but, after which, only the final model output is observed. This formulation using measured training data contrasts with how models used for forecasting physical spatiotemporally chaotic processes are formulated, which is typically done using the available scientific knowledge of the underlying mechanisms that govern the system's evolution. For example, in the case of forecasting weather, this knowledge includes the Navier-Stokes equations, the first law of thermodynamics, the ideal gas law, and (see Sec. IV) simplified representations of physics at the unresolved spatial scales [4].
In this paper, focusing on the key issues of scalability and unresolved subgrid physics, we consider the general problem of forecasting a vary large and complex spatiotemporally chaotic system where we have access to both past time series of measurements of the system state evolution and to an imperfect knowledge-based prediction model. We present a method for combining machine learning prediction with imperfect knowledge-based forecasting that is scalable to large systems with the aim that the resulting combined prediction system can be significantly more accurate and efficient than either a pure knowledge-based prediction or a pure machine-learning-based prediction. A main source of difficulty for scalability of the machine learning is that the dimension of the state of the systems we are interested in can be extremely large. For example, in state-of-the-art global numerical weather models the state dimension (number of variables at all grid points) can be on the order of 10 9 . Thus, both the machine learning input (the current atmospheric state) and output (the predicted atmospheric state) have this dimensionality. (In contrast to the description of some machine learning techniques as "deep", one might refer to the situations we address as "wide".) The prediction method that we propose for such large complex systems builds on the previous work on parallelizable machine learning prediction [5] and hybridization of knowledge-based modeling with machine learning [6]. We call our technique Combined Hybrid/Parallel Prediction (CHyPP, pronounced "chip"). Although the general method we propose is applicable to different kinds of machine learning, the numerical examples presented in this paper use a machine learning method known as reservoir computing [7,8]. Jaeger and Haas [9] described the effectiveness of reservoir computing for predicting low-dimensional chaotic systems. Research surrounding this technique has since expanded [10,11], and it has recently been shown that reservoir computing using recurrent neural networks can produce similar quality predictions for chaotic systems to those of other recurrent architectures, such as LSTM's [12] and GRU's [13], while often requiring much less computational time to train [14]. Reservoir computing techniques can additionally be extended to physical implementations using, e.g., photonics [15] and Field Programable Gate Arrays (FPGA's) [16,17].
The rest of the paper is organized as follows. In Sec. II, we first review a simple version of reservoir computing [7,8] and discuss its shortcomings for forecasting high-dimensional spatiotemporal chaos. We next describe the hybrid reservoir prediction technique (Refs. [6] and [18]), as well as previous work on how machine learning can be parallelized for prediction of spatiotemporal systems [5]. We then present our proposed CHyPP architecture combining the two. In Sec. III, we demonstrate how the CHyPP methodology improves on each of the component prediction methods. For these demonstrations we use the paradigmatic example of the Kuramoto-Sivashinky model as our test model of the spatiotemporally chaotic system that we aim to predict. We highlight the scalability of the proposed method to very large systems as well as its efficient use of training data, which we view as the crucial issues for the general class of applications in which we are interested. In Sec. IV, we consider a situation with multiple time and space scales and show by numerical simulation tests, that CHyPP can, through its use of data, effectively account for unknown subgrid scale processes. The main conclusion of this paper is that our CHyPP methodology provides an extremely promising framework, potentially facilitating significant advances in the forecasting of large, complex, spatiotemporally chaotic systems. We believe that, in addition to weather, the method that we propose may potentially be applicable to a host of important areas, enabling currently unattainable capabilities. Some speculative examples of potential applications are forecasting of ocean conditions, forecasting conditions in the solar wind, magnetosphere and ionosphere (also known as 'space weather', important for its impact on Earth-orbiting satellites, GPS accuracy, and high frequency communications), forecasting the evolution of forest fires and their response to mitigating interventions, forecasting the responses of ecological systems to climate change, analysis of neuronal activity, etc.

A. A Simple Machine Learning Predictor
To begin, we initially consider the goal of a generic machine learning system for timeseries prediction of an unknown dynamical system evolving on an attractor of that system.
Later, we will consider that the machine learning system is a reservoir computer and that the unknown chaotic system is not completely unknown, and we will try to make use of that knowledge. Given a finite duration time series of the unknown system state's evolution up to a certain time t 0 , where the state at each time is represented by the K-dimensional vector u(t) = [u 1 (t), u 2 (t), . . . , u K (t)] T , our goal is to predict the subsequent evolution of the state.
As illustrated in Fig. 1(a), in the initial training phase, at each time t = n∆t ≤ t 0 , u(t) is input to the machine learning system (u in (t) = u(t)), which is trained to output a time ∆t prediction of the dynamical system state u(t + ∆t) (u out (t + ∆t) u(t + ∆t)). We refer to the just-described input-output configuration as the "open-loop" system ( Fig. 1a). To ensure an accurate representation of the true dynamics with a reservoir of limited size, ∆t is typically short compared to natural time scales (such as the correlation time or the "Lyapunov time") present in the unknown dynamical system whose state is to be predicted. Once trained, the machine learning system can be run in a "closed-loop" feedback configuration ( Fig. 1(b)) to autonomously generate predictions over a finite duration of time. That is, with t 0 representing the time at the end of the training data, we replace the former input from the training data by the output by inserting an output-to-input feedback loop, shown by the dashed line in Fig. 1(b). Then, when u in (t 0 ) = u(t 0 ) is the input, the reservoir computer produces an output prediction for u(t 0 + ∆t), which we refer to asũ(t 0 + ∆t) = u out (t 0 +∆t). When this predicted state is then used as the input (u in (t 0 +∆t) =ũ(t 0 +∆t)), the reservoir computer produces an output prediction for u(t 0 + 2∆t), denotedũ(t 0 + 2∆t) (u out (t 0 + 2∆t) =ũ(t 0 + 2∆t)). This process is then iterated to produce predictions of the system state at t = t 0 + m∆t for m = 1, 2, 3, . . . (Fig. 1b).

ML ML (a) (b)
Prediction In the rest of this section, we first present background from previous work (Secs. II B-II D), then introduce our CHyPP method for combining a knowledge-based model with reservoir-based machine learning to form a scalable system concept suitable for state prediction of very large, spatiotemporally chaotic systems. Specifically in Sec. II B, we review a basic reservoir computing setup based on the methods of [7,8] along with the proposal for its use as a predictor carried out in Ref. [9]. In Sec. II C, we build upon the simple setup of Sec II B and describe the methodology from Ref. [5] for hybrid forecasting of the dynamical system using a single reservoir computing network and an imperfect model. In Sec. II D, the reservoir computing forecasting technique of Sec. II B is extended via parallelization of the machine learning with multiple parallel reservoir computers, in order to predict highdimensional spatiotemporally chaotic systems, as was first described in [5] (but without the incorporation of a knowledge-based model). Finally, in Sec. II E, we present our proposed CHyPP architecture and technique for combining the parallel reservoir method of Sec. II D with the hybridization of a knowledge-based predictor and a parallel reservoir-based machine learning prediction of Sec. II C. It is our belief that it is only by means of such a combination that the most effective application of machine-learning-enabled prediction can be realized for large, complex, spatiotemporally chaotic dynamical systems.

B. Basic Reservoir Computing
We now consider that the ML device shown in Fig. 1 is a reservoir computer which, as shown in Fig. 2 (and further discussed subsequently), consists of a linear input coupler ( W in ) that couples the state u in (t) into the reservoir (the circle in Fig. 2). The state of the reservoir is given by a high-dimensional vector r(t), which is then linearly coupled by W out to produce an output vector u out (t + ∆t) which, through the training, is a very good approximation to u(t + ∆t). In this paper, our implementation of the reservoir is an artificial recurrent neural network with a large number of nodes. The artificial neural network that forms our basis u out (t + ∆t) Figure 2. Diagram of the reservoir computer setup.
In the "open-loop" training phase (analogous to Fig. 1(a)), the dashed line representing coupling from the output back to the input is absent. In the "closed-loop" prediction phase (analogous to Fig. 1(b)), the coupling from the output back to the input (dashed line) is activated.
of the reservoir computing implementation is illustrated in Fig. 2. The reservoir network adjacency matrix is chosen to be a randomly generated, sparse matrix A that represents a directed graph with weighted edges. The adjacency matrix A has dimensions D r × D r , where D r is the number of nodes in the network. Elements of A are randomly generated such that the average number of incident connections per node (average number of nonzero elements of the matrix in each row) is set to a chosen value d , the "average in-degree", while the nonzero elements of A are chosen from a uniform distribution over the interval 1]. Once generated, A is re-scaled (i.e., multiplied by a scalar) so that the its largest absolute eigenvalue is a prescribed value ρ, called the spectral radius. Each node i in the network has an associated scalar state r i (t). The state of the network is represented by the D r dimensional vector r(t), whose elements are r i (t) where i = 1, 2, 3, . . . , D r .
The reservoir network state r(t) evolves dynamically while receiving input through a K × D r input coupling matrix, W in . We choose the matrix W in to contain an equal number of nonzero elements in each column, which corresponds to coupling each element of the reservoir input to an equal number of reservoir node states. Nonzero elements of this matrix are selected randomly and uniformly from the interval [-σ, σ], where σ is referred to as the input weight. Given the current state of the reservoir r(t), the reservoir state is advanced at each time using a hyperbolic tangent activation function, Before prediction begins, the reservoir computer is trained in the "open-loop" configuration. During this training phase, u in (t) = u(t) + sη(t). Here, u(t) is the measurement of dynamical system state at time t in the form of a K-dimensional vector. As in Ref. [7], we add a small, normally distributed K-dimensional vector sη(t) of mean 0 and standard deviation s to the input dynamical system state during training. The elements of the vector η(t) are chosen randomly and independently at each time t. The function of this added "stabilization noise" is to allow the reservoir computer to learn how to return to the true trajectory when the input trajectory has been perturbed away from it. We find that, in many cases, this additional small noise input beneficially promotes stability of the closed-loop prediction configuration once training has been completed.
The adjustable constants characterizing the overall prediction system (e.g., D r , d , σ, s, and ∆t) are referred to as "hyperparameters", and it is important that they be chosen carefully in order for the reservoir computer to predict accurately. For example, as explained in previous literature, the hyperparameters can be chosen so that the reservoir system has the so-called "echo-state property" (see, e.g., Ref. [7]) whereby, when in the open-loop training phase, the reservoir state r(t), aside from an initial transient and with the random sequence η(t) fixed, the reservoir state r(t) becomes uniquely determined by the reservoir input sequence u(t) (and hence independent of the initial values of r). Accordingly, prior to initiation of the training, we ignore and discard the reservoir and input states for the first few time steps. The state of the reservoir at the end of this transient nullification period is labelled r(0). Starting with r(0), the training system states u(j∆t) (j an integer, j∆t ≤ t 0 ) and the resulting reservoir states, r((j + 1)∆t), are recorded and saved. We then desire to the use these saved states to produce an output,ũ(t + ∆t), when u(t) is the input, which we desire to be very close to u(t + ∆t). To do this, we find it useful to perform an ad-hoc operation on the reservoir state vectors that squares the value of half of the node states. Specifically, we definer(j∆t) such that,r As surmised in footnote [16] of Ref. [5], this operation improves prediction by breaking a particular odd symmetry of the reservoir dynamics that is not generally present in the dynamics to be predicted. We next couple the transformed reservoir stater(t + ∆t) via a K × D r output coupling matrix W out to produce an output u out (t + ∆t), and we endeavor to choose ("train") the matrix elements of W out so that u out (t + ∆t) is close to u(t + ∆t). In general, this will require that D r K. To accomplish this, we try to minimize the L 2 difference between u(t + ∆t) and u out (t + ∆t). To prevent overfitting, we insert a Tikhonov regularization term to penalize very large values of the matrix elements of W out [19]; that is, we find over the KD r scalar values of the matrix W out . Here, β is a small regularization parameter and . . . denotes the L 2 norm. This technique is commonly known as ridge regression. In our subsequent numerical experiments (Secs. III and IV), we use the "matrix solution" for the minimization problem to determine the trained W out . In particular, we proceed as follows.
We first form a matrixR where the j th column is the j th transformed reservoir stater(j∆t).
We define a target matrix U consisting of the time series of training data such that the j th column of U is u(j∆t). We then determine a matrix W out that satisfies the following linear system, We note that methods of solving Eq.(5) for W out other than direct matrix solution are also available and may sometimes be advantageous (e.g., GMRES [20], stochastic gradient descent [21], etc.). By means of this minimization, it is hoped that u out (t) ∼ = u(t) is achieved.
This completes the training process, following which we can switch to the closed-loop configurations ( Fig. 1b and the dashed line in Fig. 2) and attempt to predict the subsequent evolution of u(t). Prediction can then proceed via Eqs.(1), (2) and (6) where the prediction of the dynamical system stateũ(t) = u out (t) and the reservoir input is received from the feedback loop (u in (t) = u out (t)).
The closed-loop configuration system can be regarded as a surrogate dynamical system that mimics the original unknown dynamical system. As such, if the original unknown dynamical system is chaotic, the closed-loop predictor system will also be chaotic. Due to the exponential growth of small errors implied by chaos, we cannot expect prediction to be good for more than several Lyapunov times (the Lyapunov time is the typical e-folding time for error growth in a chaotic system). Thus we will regard our predictions to be successful when they are good for a few Lyapunov times.
Now consider that we have made a prediction for u(t), and, at some later time, we wish to perform another prediction of the same spatiotemporally evolving system with unknown dynamics. It is not necessary to retrain our predictor; we can, instead, re-use the previously obtained W out [5]. To do so, we re-initialize the reservoir state to zero, switch the reservoir computer into its open-loop configuration, and allow it to evolve given input states of the unknown dynamical system measured at times t p − T S ≤ t ≤ t p (i.e., u in (t) = u(t)). T S is some synchronization time that is sufficiently longer than the characteristic memory of the reservoir computer but, importantly, is much shorter than the necessary training time needed to determine W out . t p is the time at which we want to begin our prediction. After this synchronization period, the reservoir computer is switched to its standard closed-loop prediction configuration and is used to make predictions at later times.
If the original system is very high dimensional (i.e., the dimension K of the measure vector u(t) is very large), then D r K must be exceedingly large. This can make the training to determine W out infeasible. For example, if we solve Eq.(5) by the direct matrix method, Eq.(6) shows that we must solve a D r × D r linear system of equations. For our computational resources, we find that this becomes impossible as D r approaches 2 × 10 4 .
Due to this and other similar considerations, we deem the method discussed in this section to be untenable for the prediction of large, spatiotemporally chaotic systems of the type we are interested in.  In the "open-loop" training phase (analogous to Fig. 1(a)), the dashed line representing coupling from the output back to the input is absent. In the "closed-loop" prediction phase (analogous to Fig. 1(b)), the coupling from the output back to the input (dashed line) is activated.

C. Hybrid Reservoir Computing
In this section, we briefly review a hybrid scheme proposed in Ref. [6] for combining reservoir computing with an imperfect knowledge-based model of the dynamical system of interest. We again assume access to time series data of measurements of the state of the dynamical system. We further assume that an imperfect knowledge-based model of the system producing the measurements is available and that this imperfect model is capable of forecasting the state of the dynamical system with some degree of accuracy, which we wish to improve upon. In the hybrid setup of Ref. [6] (Fig. 3) described below, it has been shown that the machine learning method and the knowledge-based model augment each other and, in conjunction, can provide a significantly better forecasts than either the knowledge-based model or the pure machine learning model acting alone.
As in Sec. II, we assume that the data used for training is given by K measurements of the state of the dynamical system at equally spaced increments in time, ∆t, forming a vector time series u(t). The imperfect knowledge-based model M is an operator that maps the state u(t) to a forecast of the state at time (t + ∆t).
We advance the reservoir state in time using the same activation function as described in Sec. II B, Once again, during the training phase, u in (t) = u(t) + sη(t). The training process is similar to the one employed in Sec. II for the basic reservoir computer but with the addition of the knowledge-based prediction (as illustrated in Fig. 3). Using ridge regression, we find a linear mapping W out fromr(t) and M[u(t) + sη(t)] to produce an approximate prediction of u(t + ∆t), Here, u in (t) is the same as that input to the reservoir, u in (t) = (u(t)+sη(t). We again include the small sη(t) vector in the knowledge-based model input during training to improve the stability of the method. Additionally, recall thatr is related to r by Eq.(2). In the prediction phase, we run the hybrid system in a closed loop feedback configuration (Fig. 3 with the dashed line feedback connection present) using Eqs. (7), (2), and the following equation, During the prediction phase, the hybrid forecastũ H (t) = u out (t) and the hybrid input is received from the feedback loop (u in (t) = u out (t)). Note that, in this scheme, the output is a linear combination of the reservoir state and the knowledge-based model output that optimizes the agreement of the combined system output with the training data. Thus, we can regard the result as being an optimum combination of the reservoir and knowledgebased components. Hence, we expect that if one component is superior for some aspect of the prediction, then it will be weighted more highly for that aspect of the prediction. This suggests that predictions by this method may be greatly improved over those available from either the knowledge-based component or the reservoir component acting alone (e.g., see Fig. 7 of Ref. [6]).
In addition to the hybrid configuration shown in Fig. 3, we have also tested a modified configuration in which there is an additional input to the reservoir component from the output of the knowledge-based model M. We have empirically found that this modification sometimes yields a small positive improvement in prediction; however, for simplicity, we henceforth only consider the configuration in the figure.

Spatial Grid
Reservoir Forecast To obtain a good prediction of a chaotic dynamical system state using reservoir computing, the reservoir dimensionality must be much greater than that of the dynamical system (i.e., D r K) so that there are enough free parameters available in W out for fitting the reservoir output state to the measured dynamical system state at time (t + ∆t). This can cause the computational cost of determining an optimum output matrix to become unfeasibly high for large dynamical systems, e.g. because implementation of this step by the method of Eq. (6) involves solving a D r × D r linear system. As a point of reference, we note that the dimension of the state vector of a current typical operational global weather forecasting models is on the order of 10 9 . A method to make consideration of such problems feasible for machine learning approaches was proposed in Ref. [5]. The idea is to exploit the short range of causal interactions over a small time period in many spatiotemporally chaotic dynamical systems.
This was shown to allow the use of multiple relatively small reservoir computers that each make predictions of a part of the full dynamical system state as in a local region, illustrated in Fig. 4 and explained below. This method has the advantage that, in the training phase, all of the relatively small output matrices W out,p of each reservoir computer can be trained independently in parallel, thus greatly reducing the difficulty of training.
For illustrative purposes, consider a spatiotemporal dynamical system in one spatial dimension with periodic boundary conditions. Let the dynamical system state be represented T where each scalar component u j (t) represents the time series at a single spatial grid point. We divide the system state into P equally sized, contiguous regions containing Q system variables, where P Q = K. We denote the system variables in these regions as u where 1 ≤ p ≤ P . Each local region in space is predicted by a reservoir R p , each of which has internal reservoir states r p (t) and adjacency matrix A p generated via the process described in Sec. II B. Each reservoir is coupled to its input, u in,p (t), via a matrix of input weights, W in,p . This input corresponds to a local region of the system that contains the region to be predicted by that reservoir as well as a size overlap region on either side, Each reservoir state is advanced using the following equation, During the training phase (Fig. 4 with the dashed output-to-input connection absent), Here, the (2 + Q) dimensional vector η(t) p is the p th local region of a global vector of normally distributed random variables, η(t), chosen independently at each time t, After a suitably long transient nullification period, we determine the output matrices W out,p for each reservoir that solve the least squared optimization problem using ridge regression, min Wout,p 0≤t<t 0 W out,prp (t + ∆t) − u p (t + ∆t) 2 + βTrace(W out,p W T out,p ) .
where, in Eq.(13), U p andR p are analogous to U andR in the single reservoir prediction (see Eq. (6)). U p is the target matrix such that the j th column is u p (j∆t), whileR p is obtained from R p analogous to the single reservoir case as described in Eq. (2). Each W out,p is calculated by solving the following equation, Note that, as previously claimed, the minimization problem for each p, Eq. (13), is completely independent, and can be solved for the different p in parallel. As in Sec. II B, in the prediction phase and, after a period of synchronization, we produce a full state predictionũ(t) by running the system in a closed loop feedback configuration (i.e., Fig. 4 with the dashed output-to-input feedback connection present). This is done by concatenating the local predictions from each reservoirũ p (t) = u out,p (t) (where u out,p (t) is the output state from each reservoir reservoir). Reservoir p then receives inputs from its own outputs in addition to the left and right overlap zone inputs from the left grid points, and the right grid points. The entire system thus evolves as follows, . . .
r p (t) is obtained from r p (t) using Eq. (2).

E. Combined Hybrid/Parallel Prediction (CHyPP)
Spatial Grid Hybrid Forecast W out,p−1 W out,p W out,p+1 In our previous work [6], we constructed a hybrid prediction method using a single reservoir. In this section, we consider a combination of the parallel approach of Sec. II D (which enables computationally efficient scaling to high-dimensional spatiotemporal chaos) and the hybrid approach of Sec. II C (which allows us to utilize partial prior knowledge of the dynamical system) where we assume that the knowledge-based system provides global predictions over the entire spatial domain. While the approach is easily generalized to 2-and 3-dimensional spatial processes, in order to most simply demonstrate our proposed methodology, we again consider a one-dimensional, spatiotemporally chaotic dynamical system with periodic boundary conditions, with a state represented by a K-dimensional vector time series Our approximate knowledge-based prediction operator M gives a global prediction of the full system state for a time ∆t: M[u(t)] =û(t + ∆t). As in our parallel reservoir computer prediction described in Sec. II D, we partition the system state into P equally sized, continuous regions containing Q variables, where P Q = K and each such region is predicted by a reservoir R p , p = 1, 2, . . . , P .
Each reservoir R p input is coupled to a local region of the system states as in Sec. II D, and the reservoir state r p (t) is advanced using the following equation, During the initial training phase, u in,p (t) = v p (t) + sη(t) p . In Eq.(18), W in,p is the input coupling matrix for the local system states, analogous to that described in Sec. II B. As in Sec. II D, v p (t) is the state measurements at grid points within the local region to be predicted along with grid points to either side and η(t) p is the p th local region of the global vector of normally distributed random numbers η(t). Each reservoir is trained independently in parallel using a set of training data consisting of an equally spaced time series of measured states of the large scale dynamics beginning at t = 0 after some initial transient nullification period. Again, we solve the least squares optimization problem with ridge regression to determine an output mapping for each reservoir (analogous to Eq.(13)), In Eq. (19),Û p is a matrix whose j th column isû p (j∆t), whereû p (j∆t) is the knowledgebased prediction of the p th local region of the system, The solution to Eq.(19) by the direct matrix method is In the prediction phase, we run the system in a closed-loop configuration according to Eqs. (22) to predict each local region of the system given each reservoir state and knowledgebased model prediction, obtainingũ p (j∆t). The local predictions are appropriately concatenated to form a full state prediction, which is then used as input for the next prediction step, as follows, . . .
In the above equations, we once again calculater p (t) from r p (t) using Eq.(2).

III. TEST RESULTS ON THE KURAMOTO-SIVASHINSKY EQUATION
We test the effectiveness of our proposed CHyPP method (Sec. II E) by forecasting the state evolution of the Kuramoto-Sivashinsky equation with periodic boundary conditions [22,23], where y = y(x, t) and y(x + L, t) = y(x, t). This spatially one-dimensional model generally produces spatiotemporally chaotic dynamics for periodicity length L 50. For the purpose of comparing various methods, we regard Eq.(26) as generating the state measurements of a putative system that we are interested in, while the imperfect prediction model we use is a modified version of this same equation, where an error term is introduced in the coefficient of the second derivative term, We have also investigated the case where the error is introduced by multiplying the y∂y/∂x term in Eq.(26) by (1 + ). For the latter case, the results of our method are qualitatively similar to those for Eq.(27). We form our simulated measured time series u(t) by taking the i th element of u(t) to be y(i∆x, t), where ∆x = L/K is the grid spacing used for our numerical solutions of Eq. (26). As a metric for how long a prediction is valid, we calculate the normalized root-mean-square error (NRMSE) between the true and predicted system states. We define the length of valid prediction, or "valid time", to be the time at which the NRMSE exceeds 0.2. Since the NRMSE saturates at sqrt(2), we consider this to be the point when error in the prediction reaches about 15% of its saturation value. In Sec. III A, we demonstrate that our CHyPP methodology can scale to predict very large systems. In Sec. III B, we show that the CHyPP method requires significantly less training data than the parallel scheme of Sec. II D (using reservoirs without a knowledge-based component).
We then discuss, in Sec. III C, the sensitivity of the CHyPP and parallel machine learning (Sec. II D) methods to the local overlap length .
For all of our numerical experiments in this paper, in addition to our solution of the imperfect model, we use a digital computer to implement the machine learning. However, if, in the future, CHyPP is applied to very large systems (e.g., weather forecasting) requiring a much larger number of parallel machine learning units, then we envision that it may prove useful to perform the parallel machine learning using a special purpose physically implemented reservoir computing array, e.g., based on FPGA's or photonic devices [15][16][17].
Such implementations, called "AI hardware accelerators", show great promise with respect to low cost, speed, and compactness.  We first test the ability of our CHyPP method to scale to large system sizes. We consider the Kuramoto-Sivashinsky equation where we fix the number of system variables (grid points) each reservoir is trained to predict to Q = 8, as well as the reservoir spatial density to P/L = 16/100 (P is the number of reservoirs), while varying the periodicity length L. For all tests in this section, we use the hyperparameters in Table I unless otherwise specified. We additionally fix the error in the incorrect model to = 0.1. Fig. 6 shows the resulting NRMSE between the true state and our hybrid parallel prediction averaged over 100 prediction periods predicted state orbit). The NRMSE is relatively unchanged as the value of L is increased, indicating that the CHyPP prediction, like the parallel reservoir-only prediction in [5], can be scaled to very large systems by the addition of more reservoirs.
Finally, we note that our test system, the Kuramoto and W out,p to be the same, as would be possible for a homogeneous system).

Duration of Training Data
The prediction results shown in Figs. 7 and 8 use the parameters contained in Table I. result of this, the length of training data before which we would not benefit from increasing the size of the reservoir is also much shorter in the CHyPP prediction. For example, consider the plots in (b) and (f) in Fig. 8, where each prediction uses 4 reservoirs. If we have a 500 Lyapunov time length of training data, we would not benefit from increasing the number of nodes per reservoir above 2000 in the reservoirs-only prediction, but could obtain a significant performance improvement if we did so using CHyPP. CHyPP exhibits its most impressive performance for very short lengths of training data. With only 22.5 Lyapunov times worth of training data, the parallel reservoir-only method is able to predict for only 0.44 Lyapunov times, whereas CHyPP with 16 reservoirs of 2000 nodes each predicts for 3.35 Lyapunov times on average, matching the saturated single reservoir-only prediction that is trained using 150 times more training data.

C. Prediction Quality dependence on local overlap length
In this section, we investigate the dependence of CHyPP performance on the local overlap length . Figure 9 shows that when the local overlap length is zero or close to zero,

PROBLEM OF SUBGRID-SCALE CLOSURE
A common difficulty in numerical modeling arises because many physical processes involve dynamics on multiple scales. As a result, fundamentally crucial subgrid scale dynamics is often only crudely captured in an ad-hoc manner. The formulation of subgrid scale models by analytical techniques has been extensively studied (e.g., see Ref. [24]) and is sometimes referred to as "subgrid-scale closure". In this section, we show that our CHyPP methodology provides a very effective data-based (as opposed to analysis-based) approach to subgrid-scale closure.
In particular, we test our CHyPP prediction method on a dynamical system with multiple spatial and temporal scales. The particular system we choose to predict is the multiscale "toy" atmospheric model formulated by E.N. Lorenz in his 2005 paper [25]. We refer to this model as Lorenz Model III. This model is a smoothed extension of Lorenz's original "toy" atmospheric model described in his 1996 paper [26] (hereafter referred to as Lorenz In Eq.(28), X is a smoothed version of Z, and Y is the difference between Z and X, Here, I denotes the smoothing distance. Thus, X describes the large-spatial-scale, long-time-  For the purposes of testing our CHyPP method, we also introduce another model formulated by Lorenz, which we will refer to as Lorenz Model II [25]. This model is equivalent to Lorenz Model III with no distinct small scale wave component or smoothing present in the equation,   For the following results in this section, we use the parameters in Table II. In addition, the value of s used in each of the reservoir computing-based prediction methods has been selected for each method to maximize the valid prediction time.  Figure 11. We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10) where we have measured every spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of s = 0.1, while CHyPP used its optimized value of s = 0.085.
caption. As seen in Fig. 10, there is a wave-like motion with a dominant wavenumber of ≈ 7 (7 oscillations along the vertical periodicity length). This corresponds to Lorenz's design of the model to mimic atmospheric dynamics, which has a predominant wavenumber for Rossby waves of this order as one goes around a mid-latitude circle. Figures 11-14 show predictions of the true Lorenz Model III dynamics (in Fig. 10) for grid resolutions of varying coarseness.
In particular, while the truth (Fig. 10) Figure 12. We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10), where, corresponding to the Model II grid, we have measured only every second spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of s = 0.25, while CHyPP used its optimized value of s = 0.08.
We find that CHyPP significantly outperforms each of its component methods, and that this is true for all of the grid resolutions we tested. We note that, unlike in the Model II and parallel reservoir-only predictions, the prediction error in CHyPP seems to appear locally in a small region and manifests as slanted streaks in the error plots in Figs Figure 15 shows  Figure 14. We plot the predictions of the normalized true dynamics from Lorenz Model III (displayed in Fig. 10), where, corresponding to the Model II grid, we have measured only every eighth spatial grid point of Model III during training. We plot the spatial grid point along the vertical axis and the model time on the horizontal axis. The plots on the left show the predictions made using each of the specified methods, while the plots on the right show the difference between the true dynamics and the prediction (i.e., the prediction error). The parallel reservoir-only prediction used its optimized value of s = 0.25, while CHyPP used its optimized value of s = 0.0155.

V. CONCLUSION
In this paper, we address the general goal of utilizing machine learning to enable expanded capability in the forecasting of a large, complex, spatiotemporally chaotic systems for which an imperfect knowledge-based model exists. Some typical common sources of imperfection in such a knowledge-based model are unresolved subgrid scale processes and lack of first principles knowledge or computational ability for modeling some necessary aspect or aspects of the physics. The hope is that these "imperfections" can be compensated for by use of computational cost and necessary amount of training data. Note that addressing the first of these issues necessarily lessens the difficulty of dealing with the second issue, since good use of any valid scientific knowledge of the system being forecast can potentially reduce the amount of learning required from the machine learning component.
To address these two issues, we propose a methodology (CHyPP) that combines two previously proposed techniques: (i) a hybrid utilization of an imperfect knowledge-based model with a single machine learning component [6,18], and (ii) a parallel machine learning scheme using many spatially distributed machine learning devices [5]. We note that (ii) applies to large spatial systems with the common attribute of what we have called 'local short-time causal interactions'. Numerical tests of our proposed combination of (i) and (ii) are presented in Secs. III and IV and demonstrate good quality prediction that is scalable with size (Figs. 6 and 7), reduced required length of training data (Fig. 8), and ability to compensate for unresolved subgrid-scale processes (Figs. 10-15).
In this paper, our proof-of-principle problems have been one-dimensional in space with relatively simple mathematical formulations. We note, however, that the CHyPP methodology is readily applicable to higher dimensions and more complicated situations. For example, we are currently in the process of applying CHyPP to global atmospheric weather forecasting for which the atmospheric state is spatially three-dimensional and the system is strongly inhomogeneous, e.g., due to the presence of complex geographic features (including continents, mountains, and oceans), as well as the latitudinal variation of solar heating.
In conclusion, we have shown that data-assisted forecasting via parallel reservoir computing and an imperfect knowledge-based model can significantly improve prediction of a large spatiotemporally chaotic system over existing methods. This method is scalable to even larger systems, requires significantly less training data than previous methods to obtain high quality predictions, is able to effectively utilize a knowledge-based forecast of information propagation between local regions to improve prediction quality, and effectively provides a means of using data to compensate for unresolved subgrid scale processes (playing a role akin to traditional analysis-based closure schemes).