Semi Conditional Variational Auto-Encoder for Flow Reconstruction and Uncertainty Quantification from Limited Observations

We present a new data-driven model to reconstruct nonlinear flow from spatially sparse observations. The model is a version of a conditional variational auto-encoder (CVAE), which allows for probabilistic reconstruction and thus uncertainty quantification of the prediction. We show that in our model, conditioning on the measurements from the complete flow data leads to a CVAE where only the decoder depends on the measurements. For this reason we call the model as Semi-Conditional Variational Autoencoder (SCVAE). The method, reconstructions and associated uncertainty estimates are illustrated on the velocity data from simulations of 2D flow around a cylinder and bottom currents from the Bergen Ocean Model. The reconstruction errors are compared to those of the Gappy Proper Orthogonal Decomposition (GPOD) method.


Introduction
Reconstruction of non-linear dynamic processes based on sparse observations is an important and difficult problem. The problem traditionally requires knowledge of the governing equations or processes to be able to generalize from the the sparse observations to a wider area around, in-between and beyond the measurements. Alternatively it is possible to learn the underlying processes or equations based on data itself, so called data driven methods. In geophysics and environmental monitoring measurements is often only available at sparse locations. For instance, within the field of meteorology, atmospheric pressures, temperatures and wind are only measured at limited number of stations. To produce accurate and general weather predictions, requires methods that both forecast in the future, but also reconstruct where no data is available. Within oceanography one faces the same problem, that in-situ information about the ocean dynamics is only available at sparse locations such as buoys or sub-sea sensors.
Both the weather and ocean currents can be approximated with models that are governed by physical laws, e.g. the Navier-Stokes Equation. However, to get accurate reliable reconstructions and forecasts it is of crucial importance to incorporate observations. Reconstruction and inference based on sparse observations is important in many applications both in engineering and physical science [1,2,3,4,5,6]. Bolton et. al. [3] used convolutional neural networks to hindcast ocean models, and in [7] K. Yeo reconstructs time series of nonlinear dynamics from sparse observation. Oikonomo et. al. [8] proposed a method for filling data gaps in groundwater level observations and Kong. et. al [2] used reconstruction techniques to modeling the characteristics of cartridge valves.
The above mentioned applications are just some of the many examples of reconstruction of a dynamic process based on limited information. Here we focus on reconstruction of flow. This problem can be formulated as follows. Let w ∈ R d , d ∈ N, represent a state of the flow, for example velocity, pressure, temperature, etc. Here, we will focus on incompressible unsteady flows and w = (u, v) ∈ R 2 where u and v are the horizontal and vertical velocities, respectively. The velocities w are typically obtained from computational fluid dynamic simulations on a meshed spatial domain P at discrete times T = {t 1 , ..., t K } ⊂ R.
Let P = {p 1 , ..., p N } consist of N grid points p n , n = 1, ..., N. Then the state of the flow w evaluated on P at a time t i ∈ T can be represented as a vector x (i) ∈ R 2N , x (i) = (u(p 1 , t i ), ..., u(p N , t i ), v(p 1 , t i ), ..., v(p N , t i )) T . (1) The collection of x (i) , i = 1, . . . , K, constitutes the data set X. In order to account for incompressibility, we introduce a discrete divergence operator L div , which is given by a N × 2N matrix associated with a finite difference scheme, and Further, we assume that the state can be measured only at specific points in P, that is, at Q = {q 1 , ..., q M } ⊂ P where M is typically much less than N. Hence, there is M = {m (i) ∈ R 2M : m (i) = C x (i) , ∀x (i) ∈ X}, where C ∈ R 2M ×2N is a sampling matrix. More specifically, C is a two block matrix the problem, e.g. [9,10,11,12,13,14]. In particular, use of proper orthogonal decomposition (POD) [9] techniques has been popular.
POD [9] is a traditional dimensional reduction technique where based on a data set, a number of basis functions are constructed. The key idea is that a linear combination of the basis functions can reconstruct the original data within some error margin, efficiently reducing the dimension of the problem. In a modified version of the POD, the Gappy POD (GPOD) [10], the aim is to fill the gap in-between sparse measurements. Given a POD basis one can minimize the L 2 -error of the measurements and find a linear combination of the POD-basis that complements between the measurements. If the basis is not know, a iterative scheme can be formulated to optimize the basis based on the measurements. The original application of GPOD [10] was related to reconstruction of human faces, and it has later been applied to fluid flow reconstruction [4]. We will use the GPOD approach for comparison later in this study.
A similar approach is the technique of Compressed Sensing (CS) [11]. As for the GPOD method, we want to solve a linear system. However, in the CS-case this will be a under-determined linear system. That is we need some additional information about the system to be able to solve it, typically this can be a condition/constraint related to the smoothness of the solution. The core difference between CS and GPOD is however the sparsity constraint. That is, instead of minimizing the L2-norm, we minimize the L1-norm. Minimizing the L1-norm favours sparse solutions, i.e. solutions with a small number of nonzero coefficients.
Another reconstruction approach is Dynamical Mode Decomposition (DMD) [12]. Instead of using principal components in the spatial domain, DMD seek to find modes or representations that are associated with a specific frequency in the data, i.e. modes in the temporal domain. Again, the goal is to find a solution to an undetermined linear system and reconstruct based on the measurements, by minimizing the error of the observed values.
During the last decade, data driven methods have become tremendously popular, partly because of the growth and availability of data, but also driven by new technology and improved hardware. To model a non-linear relationships with linear approximations is one of the fundamental limitation of the DMD, CS and GPOD. Recently we have seen development in methods where the artificial neural networks is informed with a physical law, the so called physic-informed neural networks (PINN) [14]. In PINNs the reconstruction is informed by a Partial Differential Equation (PDE) (e.g. Navier Stokes), thus the neural network can learn to fill the gap between measurements that are in compliance with the equation. This is what Rassi et. al. [15] have shown for the benchmark examples such as flow around a 2D and 3D cylinder. Although PINNs are showing promising results, we have yet to see applications to complex systems such as atmospheric or oceanographic systems, where other aspect have to be accounted for, e.g. in large scale oceanic circulation models that are driven by forcing such as tides, bathymetry and river-influx. That being said, these problems may be resolved through PINNs in the future. Despite the promise of PINNs, they will not be a part of this study, as our approach is without any constraint related to the physical properties of the data.
Another non-linear data driven approaches for reconstruction of fluid flow are different variations of auto-encoders [13,16]. An auto-encoder [17] is a special configuration of an artificial neural network that first encodes the data by gradually decreasing the size of the hidden layers. With this process, the data is represented in a lower dimensional space. A second neural network then takes the output of the encoder as input, and decodes the representation back to its original shape. These two neural networks together constitute an auto-encoder. Principal Component Analysis (PCA) [18] also represent the data in a different and more compact space. However, PCA reduce the dimension of the data by finding orthogonal basis functions or principal components through singular value decomposition. In fact, it has been showed with linear activation function, PCA and auto-encoders produces the same basis function [19]. Probabilistic version of the auto-encoder are called Variational Auto-Encoders (VAEs) [20]. CVAEs [21] are conditional probabilistic autoencoders, that is, the model is dependent on some additional information such that it is possible to create representations that are depend on this information.
Here, we address the mentioned problem from a probabilistic point of view. Let x : P → R 2N and m : Q → R 2M be two multivariate random variables associated with the flow on P and on Q, respectively. Then the data sets X and M consist of the realizations of x and m, respectively. Using X and M , we intend to approximate the probability distribution p(x|m). This would not only allow to predict x (i) given m (i) , but also to estimate an associated uncertainty. In this paper, we use a variational auto-encoder to approximate p(x|m). The method we use is a Bayesian Neural Network [22] approximated through variational inference [23,24], that we have called Semi-Conditional Variational Auto-encoder, SCVAE. A detailed description of the SCVAE method for reconstruction and associated uncertainty quantification is given in Section 3.2.
Here we focus on fluid flow, being the main driving mechanism behind transport and dilution of tracers in marine waters. The world's oceans are under tremendous stress [25], UN has declared 2021-2030 as the ocean decade 1 , and an ecosystem based Marine Spatial Planning initiative has been launched by IOC [26]. Local and regional current conditions determines transport of tracers in the ocean [27,28]. Examples are accidental release of radioactive, biological or chemical substances from industrial complexes, e.g. organic waste from fish farms in Norwegian fjords [29], plastic [30], or other contaminants that might have adverse effects on marine ecosystems [31].
To be able to predict the environmental impact of a release, i.e. concentrations as function of distance and direction from the source, requires reliable current conditions [32,33]. Subsequently, these transport predictions support design of marine environmental monitoring programs [34,35,36,37]. The aim here is to model current conditions in a probabilistic manner using SC-VAEs. This allows for predicting footprints in a Monte Carlo framework, providing simulated data for training networks used for, e.g., analysing environmental time series [38].
In this study we will compare results with the GPOD method [39]. We are aware that there recent methods (e.g. PINNS and traditional Auto-encoder) that may perform better on the specific data sets than the GPOD, however, the GPODs simplicity, versatility and not least its popularity [40,41,42], makes it a great method for comparison.
The reminder of this manuscript is outlined in the following: Section 2 presents a motivating example for the SCVAE-method in comparison with the GPOD-method. In Section 3 we review both the VAE and CVAE method and present the SCVAE. Results of experiments on two different data sets are presented in Section 4. Section 5 summarize and discuss the method, experiments, drawbacks and benefits and potential extensions and further work.

A Motivating Example
Here we illustrate the performance of the proposed method vs the GPOD method in order to give a motivation for this study. We use simulations of a two dimensional viscous flow around a cylinder at the Raynolds number of 160, obtained from https://www.csc.kth.se/~weinkauf/ notes/cylinder2d.html. The simulations were performed by Weinkauf et. al. [43] with the Gerris Flow Solver software [44]. The data set consists of a horizontal u and a vertical v velocities on an uniform 400 × 50 × 1001 grid of [−0.5, 7.5] × [−0.5, 0.5] × [15,23] spatial-temporal domain. 2 In particular, we have 400 points in the horizontal, and 50 points in the vertical direction, and 1001 points in time. The cylinder has the diameter of 0.125 and is centered at the origin, see  The homogeneous Neumann boundary condition is given at the right boundary (outlet), and the homogeneous Dirichlet conditions on the remaining boundaries. At the start of simulations, t = 0, both velocities were equal to zero. We plot the velocities at the time t ≈ 19 (time step 500) in Figure 2.
For simplicity, in the experiment below we extract the data downstream from the cylinder, that is, from grid point 40 to 200 in the horizontal direction, and keep all grid points in vertical direction. Hence, P contains N = 8000 points, 160 points in the horizontal and 50 in the vertical direction. The temporal resolution is kept as before, that is, the number of time steps in T is K = 1001. For validation purposes, the data set was split into a train, validation and test data set. The train and validation data sets were used for optimization of the model parameters. For both the SCVAE and the GPOD, the goal was to minimize the L2 error between the true and the modeled flow state. The restriction of the GPOD is that the number of components r could be at most 2M. To deal with this problem, and to account for the flow incompressibility, we added the regularization term λ L div x (i) , λ > 0, to the objective function, see Appendix A. For the GPOD method, the parameters r and/or λ where optimized on the validation data set in order to have the smallest mean error. We give more details about objective functions for the SCVAE in Section 3.2. For now we mention that there are two versions, where one version uses an additional divergence regularization term similar to GPOD.
In Figure 3 we plot the mean of the relative L 2 error calculated on the test data for both methods with and without the div-regularization. The results are presented for 3, 4, and 5 measurement locations, that is, M = 3, 4, 5. For each of these three cases, we selected 20 different configurations of M.. In particular, we created 20 subgrids Q, each containing 5 randomly sampled spatial grid points. Next we removed one and then two points from each of the 20 subgrids Q, to create new subgrids of 4 and 3 measurements, respectively. As it can be seen in Figure 3, both methods Figure 3: The mean relative error for two reconstruction methods. The orange and blue label correspond to the SCVAE with (div-on) and without (div-off) additional divergence regularization. The green and red labels correspond to the GPOD method.
perform well for the 5 measurements case. The resulting relative errors have comparable mean and variance. When reducing the number of observations, the SCVAE method maintains low errors, while the GPOD error increases. The SCVAE seems to benefit from the additional regularization of minimizing the divergence, in terms of lower error and less variation in the error estimates. The effect is more profound with fewer measurements.
The key benefit of the SCVAE is that its predictions are optimal for the given measurement locations. In a contrast, the POD based approaches, and in particular the GPOD, create a set of basis functions (principal components) based on the training data independently of the measurements. While this has an obvious computational advantage, the number of principle components for complex flows can be high and, as a result, many more measurements are needed, [39,6,45]. There are number of algorithms that aim to optimize to measurement locations to achieve the best performance of the POD based methods, see e.g., [40,39,46]. In practice, however, the locations are often fixed and another approaches are needed. The results in Figure 3 suggest that the SCVAE could be one of these approaches.

Methods
Before we introduce the model used for reconstruction of flows, we give a brief introduction to VAEs and CVAEs. For a detailed introduction, see [47]. VAEs are neural network models that has been used for learning structured representations in a wide variety of applications, e.g., image generation [48], interpolation between sentences [49] and compressed sensing [16].

Preliminary
Let us assume that the data X is generated by a random process that involves an unobserved continuous random variable z. The process consists of two steps: (i) a value z (i) is sampled from a prior p θ * (z); and (ii) x (i) is generated from a conditional distribution p θ * (x|z). In the case of flow reconstruction, z could be thought of as unknown boundary or initial conditions, tidal and wind forcing, etc. However, generally z is just a convenient construct to represent X, rather than a physically explained phenomena. Therefore it is for convenience assumed that p θ * (z) and p θ * (x|z) come from parametric families of distributions p θ (z) and p θ (x|z), and their density functions are differentiable almost everywhere w.r.t. both z and θ. A probabilistic auto-encoder is neural network that is trained to represent its input X as p θ (x) via latent representation z ∼ p θ (z), that is, As p θ (z) is unknown and observations z (i) are not accessible, we must use X in order to generate z ∼ p θ (z|x). That is, the network can be viewed as consisting of two parts: an encoder p θ (z|x) and a decoder p θ (x|z). Typically the true posterior distribution p θ (z|x) is intractable but could be approximated with variational inference [23,24]. That is, we define a so called recognition model q φ (z|x) with variational parameters φ, which aims to approximate p θ (z|x). The recognition model is often parameterized as a Gaussian. Thus, the problem of estimating p θ (z|x), is reduced to finding the best possible estimate for φ, effectively turning the problem into an optimization problem.
An auto-encoder that uses a recognition model is called Variational Auto-Encoder (VAE). In order to get good prediction we need to estimate the parameters φ and θ. The marginal likelihood is equal to the sum over the marginal likelihoods of the individual samples, that is, Therefore, we further on present estimates for an individual sample. The Kullback -Leibler divergence between two probability distributions q φ (z|x (i) ) and p θ (z|x (i) ), defined as can be interpreted as a measure of distinctiveness between these two distributions [50]. It can be shown, see [47], that where Since KL-divergence is non-negative, we have log p θ (x (i) ) ≥ L(θ, φ; x (i) ) and L(θ, φ; x (i) ) is called Evidence Lower Bound (ELBO) for the marginal likelihood log p θ (x (i) ). Thus, instead of maximizing the marginal probability, one can instead maximize its variational lower bound to which we also refer as an objective function. It can be further shown that the ELBO can be written as Reformulating the traditional VAE framework as a constraint optimization problem, it is possible to obtain the β-VAE [51] objective function if p θ (z) = N (0, I), where β > 0. Here β is a regularisation coefficient that constrains the capacity of the latent representation z.
Conditional Variational Auto-encoders [21] (CVAE) are similar to VAEs, but differ by conditioning on an additional property of the data (e.g. a label or class), here denoted c. Conditioning both the recognition model and the true posteriori on both x (i) and c results in the CVAE ELBO In the decoding phase, CVAE allows for conditional probabilistic reconstruction and permits sampling from the conditional distribution p θ (z|c), which has been useful for generative modeling of data with known labels, see [21]. Here we investigate a special case of the CVAE when c is a partial observation of x. We call this Semi Conditional Variational Auto-encoder (SCVAE).

Semi Conditional Variational Auto-encoder
The SCVAE takes the input data X, conditioned on M and approximates the probability distribution p θ (x|z, m). Then we can generate x (i) , based on the observations m (i) and latent representation z. As m (i) = Cx (i) where C is a non-stochastic sampling matrix, we have Therefore, from Equation (7) the ELBO for SCVAE is where p θ (z|m (i) ) = N (0, I). Similarly as for the β-VAE [51] we can obtain a relaxed version of Equation (8) and treat it as an constrained optimization problem. That is, ). Equation (9) can expressed as a Lagrangian under the Karush-Kuhn-Tucker (KKT) conditions [52,53]. Hence, According to the complementary slackness KKT condition β ≥ 0, we can rewrite Equation (10) as Objective functions in Equation (8) and Equation (11), and later Equation (13), show that if conditioning on a feature which is a known function of the original data, such as measurements, we do not need to account for them in the encoding phase.The measurements are then coupled with the encoded data in the decoder. We sketch the main components of the SCVAE in Figure 4. In Figure 4: The figure shows a sketch of the model used to estimate p θ (x|m (i) ). During training both the observations m (i) and the data x (i) will be used. After the model is trained, we can predict using only the decoder part of the neural network. The input to the decoder will then only be the observations and random samples from the latent space.
order to preserve some physical properties of the data X, we can condition yet on another feature.
Here we utilize the incompressibility property of the fluid, i.e., d (i) = L div x (i) ≈ 0, see Equation (2).
We intend to maximize a log-likelihood under an additional constrain d (i) , compared to Equation (9). That is where , δ > 0 are small. Equation (12) can expressed as a Lagrangian under the Karush-Kuhn-Tucker (KKT) conditions as before and as a consequence of the complementary slackness condition λ, β ≥ 0, we can obtain the objective function where p(z|m (i) , d (i) ) = N (0, I). For convenience of notation we refer to the objective function Equation (11) as the case with λ = 0, and the objective function Equation (13) as the case with λ > 0. Observe that under the Gaussian assumptions on the priors, Equation (13) is equivalent to Equation (11) if λ = 0. Thus, from now one we will refer to it as a special case of Equation (13) and denote as L 0 .
Similarly to [20] we obtain q φ (z|x (i) ) = N (µ (i) 1, (σ (i) ) 2 I), that is, φ = {µ, σ}. This allows to express the KL-divergence terms in a closed form and avoid issues related to differentiability of the ELBOs. Under these assumptions, the KL-divergence terms can be integrated analytically while the term E q φ (z|x (i) ) log p θ (x (i) |z, m (i) ) and E q φ (z|x (i) ) log p θ (d (i) |z, m (i) ) requires estimation by sampling Here l is an auxiliary (noise) variable with independent marginal p( ), and g φ (·) is a differentiable transformation of , parametrized by φ, see for details [20]. We denote L λ , λ ≥ 0 Equation (13) with the approximation above as L λ , that is, The objective function L λ can be maximized by gradient descent. Since the gradient ∇ θ,φ L λ cannot be calculated for large data sets, Stochastic Gradient Descent methods, see [54,55] are typically used where Here X R = x (ir) R r=1 , R < K is a minibatch consisting of randomly sampled datapoints, After the network is optimized, a posterior predictive distribution p θ (x|m, d) can be approximated with a Monte Carlo estimator.

Uncertainty Quantification
Letθ andφ be an estimation of generative and variational parameters, as described in Section 3.2. Then the decoder can be used to predict the posterior as (17) While sampling from the latent space has been viewed typically as an approach for generating new samples with similar properties, here we use it to estimate the prediction uncertainty of the trained model. From Equation (17) we are able to estimate the mean predictionx * and empirical covarience matrix Σ using a Monte Carlo estimator. We get where x (j) ∼ pθ(x|m * , d * ). The empirical standard deviation is then simply σ = diag( Σ). To estimate the confidence region we assume that the predicted pθ(x|m * , d * ) is well approximated by a normal distribution N (µ, Σ). Given that x * and Σ are approximations of µ and Σ, obtained from N M C samples as above, a confidence region estimate for a prediction x * can be given as where χ 2 k (p) is the quantile function for probability p of the chi-squared distribution with k = min{N M C , 2N } degrees of freedom, and Σ + is the pseudoinverse of Σ. Using the singular value decomposition, Σ = U SU T , the corresponding interval for (x * ) n , n = 1, . . . , 2N, is where u T n is nth row of the matrix U .

Experiments
We will present the SCVAE method on two different data sets. The first one is the 2D flow around a cylinder described in Section 2, and the second is ocean currents on the seafloor created by the Bergen Ocean Model [56]. The data X consists of the two dimensional velocities w = (u, v). To illustrate the results we will plot u and v components of x (i) ∈ X, see Equation (1). For validation of the models, the data X is split into train, test and validation subsets, which are subscripted accordingly, if necessary. The data sets, spitting and preprocessing for each case are described in Sections 4.1 and 4.2.
We use a schematically simple architecture to explore the SCVAE. The main ingredient of the encoder is convolutional neural network (CNN) [57,58] and for the decoder we use transposed CNN-layers [59]. The SCVAE has a slightly different architecture in each case, which we present in Appendix C.
The SCVAE is trained to maximize the objective functions in Equation (16) with the backpropagation algorithm [60] and the Adam algorithm [61]. We use an adaptive approach of weighing the reconstruction term with KL-divergence and/or divergence terms [62], that is, finding the regularization parameters β and λ. Specifically, we calculate the proportion of each term contribution of the total value of the objective function, and scale the terms accordingly. This approach prevents posterior collapse. Posteriori collapse occurs if the KL-divergence term becomes too close to zero, resulting in a non probabilistic reconstruction. The approach of weighing the terms proportionally iteratively adjusts the weight of the KL-divergence term, β, such that posterior collapse is avoided. For the result shown here we trained the SCVAEs with an early stopping criteria of 50 epochs, i.e., the optimization is stopped if we do not see any improvement after 50 epochs, and returns the best model. We use a two-dimensional Gaussian distribution for p θ (z|m (i) , d (i) ) in all the experiments.
Let the test data X test consist of n instances x (i) , i = 1, . . . , n, and x (i) denote a prediction of the true x (i) given m (i) . In the case of the SCVAE, x (i) is the mean prediction obtained as in Equation (18). For the GPOD method, x (i) is a deterministic output of the optimization problem, see Appendix A. In order to compare the SCVAE results with the results of the GPOD method, we introduce the following measures; the mean of the relative error for the prediction and the mean of the absolute error for the divergence

2D Flow Around a Cylinder
Here we return to the example in Section 2. Below we give some additional details of the data preprocessing and model implementation.

Preprocessing
The data is reduced as described in Section 2. We assess the SCVAE with a sequential split for train, test and validation. The sequantial split is defined as follows. The last 15% of the data is used for test, the last 30% of the remaining data is used for validation, and the first 70% for training. To improve the conditioning of the optimization problem we scale the data as decribed in Appendix B. The errors E (Equation (21)) and E div (Equation (22)) are calculated after scaling the data back. The input to the SVAE x (i) was shaped as (160×50×2) in order to apply convolutional layers. Here we use 5, 4, 3 and 2 fixed spatial measurements, that is, four different subgrids Q The flow state at these specific locations constitute M .

Model
A schematic description of the model is given in Appendix C. The first layer of the encoder is a zero-padding layer that expands the horizontal and vertical dimension by adding zeros on the boundaries. Here we add zero-padding of four in the horizontal and three in the vertical direction. The subsequent layers consists of two convolutional layers, where the first and second layer have 160 and 200 filters, respectively. We use a kernel size and strides of 2 in both convolutional layers and ReLu activation functions. This design compresses the data into a (42 × 14 × 200) shape. The compressed representation from the convolutional layers are flattened and are further compressed into a 64 dimensional vector through a traditional dense layer. Two outputs layers are defined to represent the mean and log-variance of the latent representation z. The reparametrization trick is realized in a third layer, a so called lambda layer, which takes the mean and log-variance as an input and generates z. The output of the encoder are the samples z (i) and the mean and the log-variance of z.
The decoder takes the latent representation z (i) and the measurements m (i) as input. The input m (i) is flattened and then concatenated with z (i) . The next layer is a dense layer with shape (42 × 14 × 200). Afterwards there are two transposed convolutional layers with filters of 200 and 160. The strides and the kernel size is the same as for the encoder. The final layer is a transposed convolutional layer, with same dimension as the input to the encoder, that is the dimension of x (i) . A linear activation function is used for this output layer. The last layer of the model is a lambda layer that removes the zero-padding. In the next section we show statistics of the probabilistic reconstruction and compare with the GPOD method.

Results
In Figure 5 we have plotted the true velocity field, the reconstructed velocities, the standard deviation of the velocities and the absolute error between the true and reconstructed velocity fields. The observations placement is shown as stars (black and white). The SCVAE with the objective function L 0 , see Equation (7), was used for this prediction. To generate the posterior predictive distributions, Equation (17), we sample 100 realizations from z ∼ N (0, I) , which allows for calculation mean prediction and uncertainty estimates, see Equation (18).
We emphasise here again that the SCVAE with L 0 and with L λ , λ > 0, are two different models. For the notation sake we here refer to λ = 0 when we mean the model with the objective function in Equation (7), and to λ > 0 when in Equation (13). The same holds for the GPOD method, see Appendix B. When λ = 0, the number of the principle components r is less 2M. The number r is chosen such that the prediction on the validation data has the smallest possible error on average. If λ > 0, no restrictions on r are imposed. In this case both λ and r are estimated from the validation data.
The general observation is that the SCVAE reconstruction fits the data well, with associated low uncertainty. This can be explained by the periodicity in the data. In particular, the training and validation data sets represent the test data well enough. In Figure 6 we have plotted four time series of the reconstructed test data at two specific grid points, together with the confidence regions constructed as in Equation (20) with p = 0.95. The two upper panels represents the reconstruction at the grid point (6,31), and the lower at (101, 25) for u and v on the left and right side, respectively. The SCVAE reconstruction is significantly better than the GPOD, and close to the true solution for all time steps.  The difference between the true and predicted estimate for the SCVAE (blue) and for the GPOD (orange). The light blue shaded region represents the difference marginals, obtained from the confidence region in Figure 11. The estimates are based on a model trained with λ = 0 and Q 3 measurement locations. Upper panels: The difference between the true and predicted estimate at grid point (6,31) for u (left) and v (right), Lower panels: The difference between the true and predicted estimate at point (101, 25) for u (left) and v (right). Figure 7 shows the difference between the true values and the model prediction in time for the same two locations. This figure has to be seen in context with Figure 5. In Table 1 we display the relative errors, Equation (21), for the SCVAE and the GPOD method, both with and without divergence regularization, for 5, 4, 3, and 2 measurement locations given in Equation (23).
The results of the SCVAE depend on two stochastic inputs which are (i) randomness in the initialization of the prior weights and (ii) random mini batch sampling. We have trained the model with a each measurement configuration 10 times, and chose the model that performs the best on the validation data set. Ideally we would run test cases where we used all the values as measurements,i.e., M = X, and test how well the model would reconstruct in this case. This would then give us the lower bound of the best reconstruction that is possible for this specific architecture and hyper parameter settings. However, this scenario was not possible to test, due to limitations in memory in the GPU. Therefore we have used a large enough M which still allowed us to run the model. In particular, we used every fifth and second pixel in the horizontal and vertical direction, which resulted in a total of (32 × 25) measurement locations, or M = 800. We believe that training the model with these settings, gave us a good indication of the lower bound of the reconstruction error. The error observed was of the magnitude of 10 −3 .
This lower bound has been reached for all measurement configurations Equation (23). However, larger computational cost was needed to reach the lower bound for fewer measurement locations. Figure 8 shows the number of epochs as a boxplot diagram. In comparison with GPOD, the SCVAE error is 10 times lower than the GPOD error, and this difference becomes larger with fewer measurements. Note that adding regularization did not have much effect on the relative error. From the motivating example we observed that regularizing with λ > 0 is better in terms of a more consistent and low variable error estimation. Here we selected from the 10 trained models the one that performed best on the validation data set. This model selection approach shows that there are no significant differences between the two regularization techniques. The associated error in the divergence of the velocity fields is reported in Table 2.   (21)) for the SCVAE prediction and the GPOD prediction with or without div-regularization, and different number of measurements.

Current data from Bergen ocean model
We tested the SCVAE on simulations from the Bergen Ocean Model (BOM) [56]. BOM is a threedimensional terrain-following nonhydrostatic ocean model with capabilities of resolving mesoscale to large-scale processes. Here we use velocities simulated by Ali. et. al [32]. The simulations where conducted on the entire North Sea with 800 meter horizontal and vertical grid resolution and 41 layers for the period from 1st to 15th of January 2012. Forcing of the model consist of wind, atmospheric pressure, harmonic tides, rivers, and initial fields for salinity and temperature. For details of the setup of the model, forcing and the simulations we refer to [32].
Here, the horizontal and vertical velocities of an excerpt of 25.6 × 25.6 km 2 at the bottom layer centered at the Sleipner CO2 injection site (58.36 • N, 1.91 • E) is used as data set for reconstruction. In Figure 9 we have plotted the mean and extreme values of u and v for each time t in T .

Preprocessing
We extract 32 × 32 central grid from the bottom layer velocity data. Hence, P contains N = 1024 points, 32 points in the horizontal and 32 in the vertical direction. The temporal resolution is originally 105000 and the time between each time step is 1 minute. We downsample the temporal dimension of the original data uniformly such that the number of time steps in T is K = 8500. We train and validate the SCVAE with two different data splits: randomized and sequential in time. For the sequential split we have used the last 15% for the test, the last 30% of the remaining data is used for validation, and the fist 70% for training. In Figure 9, the red and blue vertical lines indicate the data split for this case. For the random split, the instances x (i) are drawn randomly from X with the same percentage. The data was scaled as described in Appendix B. The input x (i) to the SCVAE was shaped as (32 × 32 × 2) in order to apply convolutional layers. We use 9, 5 and 3 fixed spatial measurement locations. In particular, the subgrid Q is given as Q 9 ={(6, 6), (6,17), (6,27), (17,17), (17,27), (17,6), (27,6), (27,17), (27,27)}, Q 5 ={(6, 6), (17,17), (27,27), (6,27), (27, 6)}, Q 3 ={(6, 27), (17,17), (27, 6)}. (24) As before, the values of u and v at these specific locations constitute the measurements m (i) ∈ M .

Model
A schematic description of the model is given in Appendices C.3 and C.4. The first two layers of the encoder are convolutional layers with 64 and 128 filters with strides and kernel size of 2 and ReLu activation functions. This compresses the data into a shape of (8 × 8 × 128). The next layers are flattening and dense layers, where the latter have 16 filters and ReLu activation. The subsequent layers defines the mean and log-variance of the latent representation z, which is input to a lambda layer for realization of the reparametrization trick. The encoder outputs the samples z (i) and the mean and the log-variance of z.
Input to the decoder is the output z (i) of the encoder and the measurement m (i) . To concatenate the inputs, m (i) is flattened. After concatenation of z (i) and m (i) , the next layer is a dense layer with shape (8 × 8 × 128) and ReLu activation. This allows for use of transposed convolutional layers to obtain the original shape of the data. Hence, the following layers are two transposed convolutional layers with 64 and 128 filters, strides and kernel size of 2 and ReLu activation's. The final layer is a transposed convolutional with linear activation functions and filter size of shape (32 × 32 × 2), i.e., the same shape as x (i) .

Results
We illustrate the obtained posterior predictive distribution in terms the predictive mean and standard deviation for the prediction at a specific time. The SCVAE is compared with the GPOD method, both with λ > 0 and λ = 0 for measurement locations given in Equation (24) for random and sequential split cases. To generate the posterior predictive distributions, Equation (17), we sample 200 realizations from z ∼ N (0, I) , which allows for calculation mean prediction and uncertainty estimates, see Equation (18). Figure 10 shows the results of the prediction at time step 1185 for both the u and v component and associated uncertainty estimates for a trained model with λ = 0 and Q 3 measurement locations (see Equation (24)). In Figure 11 we plot the true solution and the predicted mean velocity Equation (18) with the associated uncertainty, see Equation (20), for two grid points. We plot only the first 600 time steps for readability. The first grid point is (26,6) and (4,1). One location is approximately 5.1 km from the nearest observation, and another one is about 16.1 km away.   The difference between the true and predicted estimate for the SCVAE (blue) and for the GPOD (orange) based on the λ = 0 model and Q 3 measurement locations. The light blue shaded region represents the difference marginals, obtained from the confidence region in Figure 11. Upper panels: The difference between the true and predicted estimate at grid point (26,6) for u (left) and v (right), Lower panels: The difference between the true and predicted estimate at point (4, 1) for u (left) and v (right).
Integrating over the latent space generates a posterior distribution of the reconstruction, as described in Section 3.2.1. It is also possible to use the latent space to generate new statistically sound versions of u and v. This is presented in Figure 13 where it is sampled uniformly over the 2 dimensional latent space z ∼ N (0, I) and the result shows how different variations can be created with the SCVAE model, given only the sparse measurements. Figure 13: The left panels shows 9 different generated velocity-field-snapshots for the u and v component for test sample number 1185. The predictions are generated from the model with λ = 0 and Q 3 measurement locations. We sample uniformly over the latent space and predicts with the decoder, given the measurements.
These sampled velocities could be used for ensemble simulations when estimating uncertainty in a passive tracer transport, see e.g., [37].
The SCVAE results are are compared with results of the GPOD method, see Table 3 and Table 4. The tables show the errors as calculated in Equation (21) and Equation (22) of the test data set for both sequential and random split. For the sequential splitting, the SCVAE is better for 3 measurement locations, while the GPOD method performs better for 9 and 5 locations. From Figure 9, we observe that test data set seems to arise from a different process than the train and validation data (especially for v). Thus, the SCVAE generalize worse than a simpler model such as the GPOD, [63]. For the 3 location case, the number of components in the GPOD is not enough to compete with the SCVAE.
With random split on the train, test and validation data, we see that the SCVAE is significantly better than the GPOD. The training data and measurements represent the test data and test measurements better with random splitting. This highlights the importance of large data sets that cover as many outcomes as possible. Demanding that λ > 0 in Equation (16) do not improve the result. The SCVAE-models with λ = 0 learns that the reconstructed representations should have low divergence without explicitly demanding it during optimization. However, as discussed in the 2D flow around cylinder experiment, demanding λ > 0 seems to improve the conditioning of the optimization problem and give more consistent results.

Discussion
We have presented the SCVAE method for efficient data reconstruction based on sparse observations. The derived objective functions for the network optimization show that the encoding is independent of measurements. This allows for a simpler model structure with fewer model parameters than a CVAE and results in an optimization procedure that requires less computations.
We have shown that the SCVAE is suitable for reconstruction of fluid flow. The method is showcased on two different data sets, velocity data from simulations of 2D flow around a cylinder, and bottom currents from the BOM. The fact that the fluids studied in the experiments are incompressible served as a motivation for adding an extra term to the objective function, see Equation (16) with λ > 0.
Our investigation of additional regularization showed that the mean reconstruction error over all models was lower with λ > 0 compared to the model where λ = 0, but the best reconstruction error was similar for λ = 0 and λ > 0. In Section 4 we optimized 10 models for every experiment, and chose the model that performed best on the validation data sets. With this approach we did not observe significant differences between optimizing with λ = 0 and λ > 0. However, the reconstruction became less sensitive to the stochasticity involved in optimization (minibatch selection, network weights priors) when the regularization was used, see Section 2.
The SCVAE is a probabilistic model, which allows to make predictions, estimate their uncertainty, see Section 3.2.1, and draw multiple samples from the predictive distribution, see Figure 13. The last two properties make the SCVAE a useful method especially when the predictions are used in another application, i.e., ensemble simulation of tracer transport. Motivated by [46], we compared the SCVAE predictions with the predictions of a modified GPOD method, see Appendix A.
Unlike the GPOD-method, a benefit with the SCVAE-method is that it scales well to larger data sets. Another aspect and as the experiments in Section 4 suggest, the GPOD seems more sensitive to the number of measurement locations than the SCVAE. On the other hand, the experiments suggested that GPOD is better than SCVAE with a larger number of measurement locations if the training data and the test data are too different, see BOM experiment with sequential splitting Section 4.2.3. Essentially the SCVAE overfit to the training data, and as a result performing poorly on the test data set. This fact shows the importance of training the the SCVAE on large data sets, which covers as many potential flow patterns as possible. Further, the results show that the GPOD is more sensitive to the measurement location choice than the SCVAE, see Section 2, and the GPOD-method is not expected to preform well on a complex flow with very few fixed measurement locations.
VAEs has been used for generating data in e.g. computer vision [20], and auto-encoders is a natural to use in reconstruction tasks [13]. Many reconstruction approaches, including the GPOD approach, first create a basis, then use the basis and minimize the error of the observations [39,64]. This makes the GPOD suitable for fast optimization of measurement locations that minimize the reconstruction error. On the other hand, the SCVAE optimizes the basis function given the measurements, i.e. they are known and fixed. This makes it challenging to use the framework for optimizing sensor layout. But if the measurement locations are fixed and large amounts of training data are available, the SCVAE outperforms the GPOD for reconstruction. SCVAE optimize the latent representation and the neural network model parameters, variational and generative parameters, given the measurements. This ensures that the reconstruction is adapted to the specific configuration of measurements.
A limitation of our experiments is that we used only 100 and 200 samples and constructed the confidence region under further simplifying assumptions. The uncertainty estimate could be improved by increasing the sample size and better model for the confidence region.
Natural applications for the SCVAE are related to environmental data, where we often have sparse measurements. It is for example possible to optimize sensor layout to best possible detect unintentional discharges in the marine environment by using a simple transport model [37]. Oleynik. et al. used deterministic flow fields to transport the contaminant and thus obtain a footprint of the leakage. SCVAE can be used to improve that method and efficiently generate probabilistic footprints of a discharges. This may be important as input to design, environmental risk assessments, and emergency preparedness plans.
We have highlighted the SCVAE through the reconstruction of currents and flow field reconstruction, however, the SCVAE method is not limited to fluid flow problems. For instance, the same principles could be used in computer vision to generate new picture based on sparse pixel representations or in time series reconstruction.
A natural extension of the SCVAE is to set it up as a partially hidden Markov model. That is to predict the current state p θ (x t |m t , x t−1 ), given the measurements and the reconstruction from the previous time step. This could potentially improve the reconstruction further.

B Scaling of a data
Let T train contains the times t li , i = 1, ..., n corresponding to the training data. We define u max = max p,t u(p, t) and u min = min p,t u(p, t), and v max = max p,t v(p, t) and v min = min as the largest and smallest values of u and v on P and T train . Then, the middle points are given as and the half lengths as Then the whole data is scaled asũ and the divergence operator L div scaled accordingly. After the optimization is completed, the data is scaled back, i.e., The relative errors Equation (21) are calculated on the scaled data. The divergence error Equation (22) is unaffected by the scaling.

C Details on the Experiments
We use Keras [65] in the implementation of the SCVAE for all experiments. In this Appendix we present details on the architecture of the decoders and encoders for the different experiments. We have optimized the SCVAE models with different number of measurements. That is, the shape of the input layer to the decoder will be dependent on the measurements (sensor-input layer).
Here we present details on the architecture of the encoders and decoders with largest number of measurements for SCVAE models for both experiments. There is one extra dimension in the figures showing the encoders and decoders. This dimension is here one, but the framework is implemented to allow for more dimensions in time.