Embedded model discrepancy: A case study of Zika modeling

Mathematical models of epidemiological systems enable investigation of and predictions about potential disease outbreaks. However, commonly used models are often highly simplified representations of incredibly complex systems. Because of these simplifications, the model output, of say new cases of a disease over time, or when an epidemic will occur, may be inconsistent with available data. In this case, we must improve the model, especially if we plan to make decisions based on it that could affect human health and safety, but direct improvements are often beyond our reach. In this work, we explore this problem through a case study of the Zika outbreak in Brazil in 2016. We propose an embedded discrepancy operator---a modification to the model equations that requires modest information about the system and is calibrated by all relevant data. We show that the new enriched model demonstrates greatly increased consistency with real data. Moreover, the method is general enough to easily apply to many other mathematical models in epidemiology.

Potential epidemics of communicable diseases are a major health concern of the modern world, especially as city density, air and water pollution, and worldwide travel steadily increase. A stark example of this is the global Coronavirus outbreak, already responsible for more than 2000 deaths around the world at time of this article's submission, and more than 13,000 at time of revision, approximately one month later. When faced with a potential outbreak, decision-makers such as health officials and medical professionals rely on mathematical models to aid their decision-making processes. But oftentimes these models are not consistent with the dynamical system they are designed to represent. The discrepancy between the output of a model and the real system is then a serious impediment, as it may decrease our confidence in the model, or even invalidate it entirely, so that it can no longer be used to aid in decision-making. When such a discrepancy is observed, we, as modelers, must either improve the model or somehow account for the discrepancy itself. While a direct model improvement is usually the most desirable solution, how to do so may be infeasible because of computational reasons, time constraints, or lack of domain knowledge. This paper provides a systematic method to instead account for the discrepancy itself, explored via a case study of the Brazilian Zika epidemic of 2016. The method is not a correction of the model output to data, but rather a modification of the model equations themselves by a so-called embedded discrepancy operator. The operator is designed with three critical properties in mind: interpretability, domain-consistency, and robustness. We show that including the embedded discrepancy operator greatly increases the fidelity of the model, so that model output and real data are now in fact consistent.

I. INTRODUCTION
Mathematical models of scientific systems necessarily include simplifications about the actual system they aim to represent. In some cases, these simplifications do not preclude the use of the model to understand, investigate, and make decisions and predictions about the system. The quintessential example of this comes from the domain of classical mechanics: Newtonian mechanics ignores quantum and relativistic effects but, over a wide domain of masses and energies, provides a completely adequate model to describe the motion of macroscopic objects. Outside this domain, however, quantum or relativistic effects are no longer negligible, and Newtonian mechanics fails.
Just as classical mechanics is insufficient to describe a quantum system, the simplifications of a modern mathematical model may yield a discrepancy between the model and the system at hand too great to be ignored. This discrepancy is revealed during model validation, a process by which we check that the mathematical model is a reliable representation of reality. Without accounting for the discrepancy, one cannot trust the model output, much less use it to make predictions or decisions. In this case, there are two immediate options: (1) Improve the model directly, i.e., from first principles or by including additional information; and (2) Represent the model discrepancy itself. While option 1 is usually desirable, it may not be feasible due to computational constraints, or because we in fact lack the knowledge to directly improve the model. Then we are left with option 2-represent the model discrepancy.
A common approach to account for model discrepancy is through a response discrepancy function 1 , also called a bias function. A response discrepancy function corrects model output (or response) to data. Typically, an additive function on the model output is calibrated to data, either point-wise or with a parametric form. An advantage of this approach is that it can be implemented even if the model is a black box, that is, one only needs access to model output, not the model itself. There are also disadvantages. In essence, a response discrep-ancy function builds a better interpolation to a single dataset, over the range of usable data. Thus, this approach provides no basis for extrapolation, to, for example, make a prediction about the probability of an epidemic next year. Furthermore, the action of this bias function is not interpretable, as it lies outside the model equations.
Instead, in this paper, we show how to modify equations directly to account for the model error with an embedded discrepancy operator. The advantages of this approach are threefold: 1. Interpretability: As the embedded operator appears within the model equations, and acts on state variables, the action of this operator is interpretable.
2. (Domain-)Consistency: Information or constraints about the system can be incorporated into the discrepancy operator.
3. Robustness: Discrepancy parameters can, and should, be calibrated over all available data. This can include data from multiple scenarios or initial conditions.
These three properties-interpretability, consistency, and robustness-are designed to allow for decisions or extrapolative predictions. The inclusion of the embedded discrepancy operator into the original, or reduced model, yields an enriched model. In essence, the enriched model takes advantage of both mechanistic and statistical modeling: it retains the reduced mechanistic model, and incorporates a general, statistically calibrated discrepancy model. Of course, this intrusive approach highly depends on the context. Here, we investigate the value of an embedded discrepancy operator in the context of epidemiology modeling. Mathematical modeling of disease spread and outbreaks has a long and rich history; see [2][3][4][5] , to name just a few. One of the most common classes of these models consists of coupled ordinary differential equations (ODEs), whose state variables include populations of the host (here, humans) and the disease carrier, or vector. These populations are further specified as either Susceptible, Exposed, Infected, or Recovered, leading to thus-named SEIR models. 6 These models are relatively simple to implement and understand. Model parameters allow for the specification of transmission rates, incubation times, etc.
In particular, we investigate the model discrepancy of a well-studied SEIR-SEI model of the Zika outbreak in Brazil in 2016 7 . In previous work, after calibration of model parameters, the reduced model captured major tendencies of the outbreak. This was a major improvement compared to the reduced model with parameter values as suggested by current literature, which bore almost no usable resemblance to the real epidemic data. However, the calibrated model was still insufficient to precisely capture the dynamical behavior of the Zika outbreak. The current work extends previous works of embedded model discrepancy, used in the contexts of combustion 8 and ecological models 9 , to the current domain of epidemiology. The discrepancy model is embedded within the coupled model differential equations, and the introduced discrepancy parameters are calibrated to available data. The enriched model is shown to greatly outperform the original model.
To differentiate the current article from 9 , note that that study was primarily a numerical study over a constrained set of scenarios. The interaction matrices (determining reduced and true model coefficients) follow a number of assumptions, such as negative-definiteness, yielding highly well-behaved models. In addition, the actual discrepancy was known exactly between each reduced model and the corresponding data-generating model; some of that information was used to further constrain the discrepancy model parameters. Thus, the previous paper provided no guarantee that this method would work in a highly applied, real-world model scenario without those strong assumptions.
Although the current modeling scheme only describes a single outbreak 10 , and thus is not suitable to describe multiple incidences of the disease, this type of model is useful for guiding decision and policy makers. For example, 11 describes common questions faced by decision makers, such as how many total people will be infected, or even how to slow or prevent the outbreak from occurring. The objective of the present article is to reproduce with reasonable precision the data of a real epidemic. An appropriate enrichment and calibration of the model can then yield useful predictions about the dynamical system.
A final complication of this field is that new disease cases are often not reported, causing the outbreak numbers to appear artificially low. A study from 2018 estimates that as much as 90% of the cases are not reported 12 . However, as we will see shortly, the issue of faulty data is insufficient to account for discrepancy between the original model and observations. At the same time, the issue of under-reported cases does obviously play an important role during the model validation process. We try to disentangle the two problems-observational error and model error-by first considering only model error, and later allowing for both model error and significant under-reporting. We consider how the enriched model performs in different possible under-reporting scenarios, such as 10 or 50% percent.
The rest of the paper is organized as follows. Section II describes the specific model of the 2016 Brazilian Zika outbreak, and reconstructs previous results as a reference. Section III presents the formulation and calibration of the embedded discrepancy operator, and also corresponding numerical results. Section IV explores the issue of under-reporting and how well the enriched model might perform given some sample underreporting scenarios. A brief concluding discussion is given in Section V.

II. ZIKA DISEASE MODELING
As mentioned in § I, a typical approach to model the spread of infectious disease is with a set of coupled ordinary differential equations. Here we consider the well-known SEIR − SEI model, which describes coupled growth rates of species of interest, namely, susceptible, exposed, infected, and recovered humans, as well as susceptible, exposed, and infected vector. In the case of Zika (also, Dengue and Yellow fever) in Brazil, this vector is the Aedes aegypti mosquito.
In this section and the next, we assume that the data represents the actual truth. That is, the modeling objective is to achieve a model consistent with the given data.

A. Model specification
We follow the SEIR-SEI model discussed by Dantas, Tosin, and Cunha 7 , which included species Subscripts h and v indicate human and vector, respectively. The model also includes a state variable C(t), which counts cumulative new cases over time. Then, the eight coupled equations are: where N h represents Brazil's human population and N v represents the vector population. Nominal values of the interaction rates were determined by a careful literature study. These rates are: Extrinsic incubation period: 1 Intrinsic incubation period: 1 Human infectious period: 1 Vector lifespan: 1 Mosquito to human infection time: 1 Human to mosquito infection time: 1 Let us collect these model parameters into the vector θ , and let θ n refer to the nominal values given above.
To fully specify this model, it remains to provide initial con- ditions. These are: 13 Finally, let us call the above model Z and the state vector x, where x is ordered in the same way as equations 1a-1h (x 1 = S h , x 2 = E h , and so on). Then we may refer to the above model as: We may also refer to this model as the original model, or reduced model.

B. Previous results compared to data
The work in 7 presents a detailed approach to calibrate this model to data. The data is made available by the Brazilian Ministry of Health 14 and is available as supplementary material in 7 . Each data point d i , i = 1, . . . , 52, gives the recorded cumulative number of Zika cases at epidemiological week i of the year 2016. In this section, we re-plot the results from that paper to serve as an immediate reference and comparison.
First, Figure 1 compares the model output to data, using the above model with nominal parameters. The model with nominal parameter values, Z (x; θ = θ n ), severely underestimates the outbreak. Note that under-reporting cannot explain the observed discrepancy: higher reporting rates would only increase this discrepancy.
Clearly the reduced model, given θ n parameter values, is a poor representation of reality. After observing such a discrepancy, the authors of 7 performed a sophisticated calibration of the model parameters θ , using the Trust-Region-Reflective (TRR) method 15,16 . Following that method, and using the public data to calibrate, two slightly different results are obtained by imposing a different set of constraints on possible parameters values. The model outputs after both calibrations are shown in Figure 2. Although the model output is now much closer to the data, still, a detectable inconsistency persists. To be specific, note that after about week 30, the difference is tens of thousands of new cases of humans infected by Zika. From a modeling perspective, the salient point is that the model is unable to capture the dynamical behavior of the outbreak, even after calibration. Assuming (as we are for now), that the given data is correct, this suggests that the problem lies with model itself. Indeed, there are several possible sources of model error, that may impact not only this specific Zika model but many other epidemiological models. First, these SEIR models are not built from first principles, but rather from assumptions about interactive behavior, empirical information, and domain scientists' intuition and experience. Second, these models provide a continuous, deterministic description of discrete interactions, which naturally involve some stochasticity 17 . With large enough populations, though, this should not be a problem. Third, diseases do not spread in a closed system of host and vector. Rather, the spread of a disease involves other species such as livestock and non-human primates 18 . Fourth, other modes of transmission are possible, such as sexual interaction 19,20 and blood transfusion 21 . Finally, there could certainly be additional time-or spatially-dependent effects, such as migrations and local dynamics 22,23 , collective behavior and timedelayed synchronization dynamics 24 . Some modelers assume power-law dynamics of the networks 25 , while others use fractional derivatives to describe relevant dynamics 26 . Instead, this model assumes time-independent parameters, and only models populations over time, not space. In summary, the spread of a contagious disease is an incredibly complex problem, and it remains unclear what is critically missing from the model or how to best improve it directly from epidemiological information. Here, then, we can turn to the field of model

III. EMBEDDED DISCREPANCY OPERATOR
Before describing the embedded discrepancy operator, we illustrate the overall relationship between the different models considered in this paper. A schematic diagram is shown in Figure 3. As seen in the previous section, after calibrating the model parameters to data, there is still a significant discrepancy between the model output and the data. That is, the answer to Q2 (and Q1) in Figure 3 is "No," and so we move to state (M3): we model this discrepancy with the goal of reaching states (U1) and ultimately (E3).

A. Proposed approach: Embedded discrepancy operator
Previous work has shown that missing dynamics on the right-hand side (RHS) of differential equations can be approximated with "extra" information about the existing state variables [27][28][29] , such as memory or derivative information. Exploiting this, we pose the following enriched model: where with κ = (κ 1 , . . . , κ 7 ) and λ = (λ 1 , . . . , λ 7 ). That is, the differential equation for x i , i = 1, . . . , 7 in the reduced model is modified by two additional terms, one linear in x i and the other linear in dx i /dt. The discrepancy parameters are collected into the vector φ : Note, the RHS for dC(t)/dt is not modified because this function simply counts the exposed cases as given by the model. A change here would be analogous to modifying model output itself, not interpretable, and not reliable for any type of decision or prediction. As mentioned in Section I, this type of discrepancy model can be constrained to available information about the system. For example, the discrepancy operator for a combustion reaction in 8 is constrained to satisfy conservation of atoms and conservation of energy. In this scenario we do not have such strict constraints; see 9 for constrained operators in similar Lotka-Volterra models.
All together, the enriched model is We set θ = θ n and use the same initial conditions as in equations 3a-3h. The final step to fully specify the enriched model is to calibrate the discrepancy parameters φ ; this step is explained in the following subsection.

B. Calibration details
In contrast with the calibration process of 7 described in § II B, here we use a Bayesian framework 30,31 to calibrate the discrepancy parameters φ given the data d. This allows for the representation of uncertainty about these parameters, and also how this uncertainty propagates to model output.
Recall that the observations are cumulative cases at each epidemiological week, d = {d i }, i = 1, . . . , 52. For each d i , let y i be the corresponding model output. We assume that the measurements are independent and that the measurement error is additive and Gaussian as such: with standard deviation σ ε = 5 × 10 3 . This standard deviation value seems reasonable as the uncertainty in reported values is high, and because the observations are on the order of tens to hundreds of thousands.
In the Bayesian framework, the conditional probability density of φ given the data d, p po (φ |d), is called the posterior and given as: We specify each term on the RHS above: • Prior: The prior density p pr (φ ) collects the prior knowledge we have about the parameters. Specifically, these parameters are assumed independent and uniform in the prior, where each p pr (φ i ) = U (−0.3, 0.15), and so • Likelihood: The likelihood p li (d|φ ) tells us how likely it is to observe d, given a particular value of φ . The measurement error model in Eq. (11) yields the likelihood function where Σ = σ 2 ε I.
• Evidence: The evidence p ev (d) = p li (d|φ )p pr (φ )dφ gives the probability of observing the data d. This is typically difficult to compute, but note that it is not a function of φ . With a Markov chain Monte Carlo (McMC) approach, the posterior is found by computing ratios of the RHS in equation 12 (for different values of φ ), and so fortunately this term cancels.
Under this framework, the discrepancy parameters φ are calibrated using the DRAM method, developed by 32 and implemented through the library QUESO 33 .
The complete code for this project is available here: https://github.com/rebeccaem/zika 34 . Table I presents posterior means and standard deviations for each of the fourteen marginal posterior densities of the discrepancy parameters (as histograms).  Figure 4 shows the enriched model response compared to the data. Uncertainty in discrepancy parameters φ is propagated through to model output: the thick center line shows the median response, the darker band shows the 50% confidence interval, and the lighter band the 95% confidence interval (CI). Importantly, all observations are in fact captured by the 95% CI.
For comparison's sake, Figure 5a presents at once all model responses considered in this paper, and Figure 5b shows the same, but zoomed into weeks 12-30. (For visualization purposes, only the median line is shown for the enriched model.) The enriched model is clearly an improvement.

D. Interpretation
The embedded discrepancy operator can be interpreted from two different points of view. The first is a more mathematical lens, although not disconnected from the physics: we interpret the discrepancy operator as a linear feedback signal. The second, in contrast, relies on an epidemiological basis: here we interpret the corrections made by the discrepancy operator as effects due to causes of biological origin.
This second point of view is especially interesting for the goal of elucidating potential deficiencies in the baseline model. As a wide range of issues must yet be explored to obtain a consistent epidemiological interpretation, this line will not be addressed in this manuscript, but will be the topic of future work. Instead, the first perspective is explored further below.
In light of theory of systems with linear feedback, the discrepancy operator is a linear combination of the system state and its first order time-derivative, thus defining a signal that feeds the original nonlinear system with information from the present state and its rate of change. Roughly speaking, the parameters of the enrichment can be seen as "gains" that adjust to drive the epidemic curve generated by the model towards the real observational curve. These parameters are identified via Bayesian inference, with prior distributions that ad-   To understand more deeply why this competition between corrective signals produces such an effective correction, consider the injection and removal of information (i.e., energy) in the system, as well as its flow between the different coordinates of the state (groups of human and mosquito populations). To make an analogy with the dynamics of mechanical oscillators, the feedback effects proportional to the state derivative produce a kind of "viscous force," which introduces (via positive feedback) or removes (via negative feedback) information into or from the epidemiological state variables. Furthermore, the terms proportional to the system state correspond to a kind of "restoring force," which redistributes information among the different population groups. The intensity of this additional information flow between the different coordinates of the system state is controlled by the new time It should be noted that the approach presented in the paper is not control theory in the literal sense, since no information from the biological system is obtained in real-time, nor is an action signal sent to the real system to adjust its epidemic curve trajectory. Therefore, the observability and controllability issues related to the real system are not taken into account. The notion of duality between parameters and gains is explored above only as a way to pave an initial reasonable interpretation of how the discrepancy operator acts to correct the model's response, by promoting additional information flows between the different compartments of the populations.

IV. EFFECTS OF UNDER-REPORTING
Now let us also consider the scenario that the data is in fact under-reported. First, suppose 10% of cases are not reported, so that d i = .9d * i , where d * i represents the value of observations we would expect without under-reporting. (This is not claiming d * i is the exact true value, as we still expect unbiased measurement error.) The discrepancy parameters are recalibrated, and the corresponding model response is shown in Figure 6. Again, all (modified) observations are captured by the enriched model response.
Finally, we suppose that only 50% of cases are reported, so that d i = 0.5d * i . These results are shown in Figure 7. Even here, the enriched model adapts to this scenario and covers the dynamical behavior of the outbreak in this highly underreported scenario. discrepancy operator greatly improves the consistency between model output and available observations.
The general applicability of this method to other epidemiological models is best understood in two parts. One one hand, the formulation of the enriched model equations is immediately applicable to another model, if that model is also comprised of a set of ODEs. That is, nothing prevents a modeler from testing the proposed model enrichment framework in that case. On the other hand, the particular details of the calibration, and whether or not this approach is in fact able to capture the discrepancy between the original model and the data may depend on domain specific information. Future studies will test this approach in other outbreak data sets.
Many other open questions remain. In Section III D, we discuss two possible interpretations of the embedded discrepancy operator. First, the linear terms added into the differential equations resemble a type of linear feedback, with one term proportional to the state variable, and one a differential "control" term. In this case, the discrepancy operator is not actually controlling the real system itself, but driving the model system to the target (epidemic data). Second, and perhaps more importantly, would be a physiological interpretation of the discrepancy terms. While beyond the scope of this paper, a deep exploration of interpretability, connections to linear feedback theory, and an explanation of these discrepancy terms in a physiological sense will be the subject of immediate future work.
Related to the point above, we would also like to understand what the calibrated discrepancy operator implies about the missing dynamics of the reduced model. That is, can we use the learned discrepancy model to infer what the reduced model is most critically lacking? This question is currently under study, also in the context of ecological models (which have a similar structure, as sets of coupled ODEs). Doing so would allow the use of these embedded operators to function as a type of modeling tool, as opposed to only a model correction.
Finally, this study would be perhaps more convincing with more trust-worthy data. How to achieve this, though, is just as complex a problem as the epidemiological system itself, as it involves accessibility to healthcare in remote regions, public awareness of mandatory reporting policies, and incentives and rewards for timely reporting of a communicable disease.