Separating internal and externally-forced contributions to global temperature variability using a Bayesian stochastic energy balance framework

Earth's temperature variability can be partitioned into internal and externally-forced components. Yet, underlying mechanisms and their relative contributions remain insufficiently understood, especially on decadal to centennial timescales. Important reasons for this are difficulties in isolating internal and externally-forced variability. Here, we provide a physically-motivated emulation of global mean surface temperature (GMST) variability, which allows for the separation of internal and external variations. To this end, we introduce the ``ClimBayes'' software package, which infers climate parameters from a stochastic energy balance model (EBM) with a Bayesian approach. We apply our method to GMST data from temperature observations and 20 last millennium simulations from climate models of intermediate to high complexity. This yields the best estimates of the EBM's forced and forced + internal response, which we refer to as emulated variability. The timescale-dependent variance is obtained from spectral analysis. In particular, we contrast the emulated forced and forced + internal variance on interannual to centennial timescales with that of the GMST target. Our findings show that a stochastic EBM closely approximates the power spectrum and timescale-dependent variance of GMST as simulated by modern climate models. Small deviations at interannual timescales can be attributed to the simplified representation of internal variability and, in particular, the absence of (pseudo-)oscillatory modes in the stochastic EBM. Altogether, we demonstrate the potential of combining Bayesian inference with conceptual climate models to emulate statistics of climate variables across timescales.

Understanding the statistical properties and sources of Earth's surface temperature variations is of great importance in climate science. To this end, we analyze the variability of global mean surface temperature (GMST) with a simple stochastic energy balance model (EBM). With Bayesian methods and spectral analysis, we separate internally-generated and externally-forced contributions to GMST variations on different timescales in stateof-the-art climate model simulations. Our results show that a stochastic EBM can emulate the variability of more complex climate models. The combined use of Bayesian inference and conceptual climate models therefore provides a versatile tool to advance the understanding of internal and forced variability in Earth's dynamical system.

I. INTRODUCTION
Climate variability describes the spatial and temporal variations in the mean and higher order statistics of climate parameters, and is of vital importance for living conditions on Earth 1 . While many sources of natural variability exist, anthropogenic influences clearly dominate the recent trend in global mean surface temperature (GMST). To characterize a) Electronic mail: beatrice-marie.ellerhoff@uni-tuebingen.de variability, it is typically partitioned into internal and external components. Internal variability arises from intrinsic climate system processes such as oceanic and atmospheric circulation. External sources include changes in radiative forcing, for example, from solar irradiance, volcanic eruptions, and greenhouse gases. Despite a general agreement of the total simulated and observed GMST variability over the Common Era (0-2000 CE) 2,3 , uncertainties remain about the mechanisms and magnitude of internal and external variations [4][5][6] , especially on decadal to centennial timescales 3,7 .
Simple mathematical models help understand climate variability [8][9][10] and can be used to emulate climate variables from more complex model simulations. Most general, the time evolution of a forced climate parameter X(t) is described byẊ(t) = A (t, X(t), F(t)) for an arbitrary operator A and external driver F(t). We consider X as the GMST, for which many studies have formulated physically-motivated approximations of A . One pivotal approach is centered around the idea of balancing incoming and outgoing radiation 8,9 , later extended to a stochastic energy balance model (EBM) by Hasselmann 10 . This approach assumes the climate system close to equilibrium, showing a linear and stationary response to perturbations. Then, A can be approximated by a linear stochastic operator 8,9 : Formula (1) describes the GMST anomaly T (t) with respect to the equilibrium state, given Earth's effective heat ca-pacity C, a radiative forcing anomaly F(t) and a term ε(t), representing stochastic dynamics such as weather fluctuations. The response parameter λ :=λ /C is the reciprocal of the characteristic timescale 1/λ . The response to radiative forcing F(t) determines forced temperature variations. The response to the stochastic term ε(t) approximates internal variability.
The stochastic EBM (1) is too simplistic to accurately represent long-term responses and, therefore, has been extended to so-called multibox EBMs [11][12][13][14][15] . The latter are based on multiple ocean layers, referred to as boxes. The layers serve to approximate the vertical heat transfer and the integrated response to forcing over long periods. The EBM (1) laid the basis for attributing anthropogenic warming 16,17 . It was applied and modified to study climate sensitivity 14,[18][19][20] , climate and ice cap stability 21-24 , regional temperatures [25][26][27] , Glacial/Interglacial cycles 22,28 , and future projections 29,30 . Key advantages of EBMs are their computational efficiency and comparatively easy interpretation.
To estimate uncertain parameters of conceptual climate models from data, Bayesian frameworks have become increasingly popular [31][32][33][34][35] . In comparison to other methods for inferring climate parameters, such as maximum likelihood estimation, Bayesian approaches have the advantage of providing full posterior distributions. The methods compute the posterior means and credible intervals (CIs) of uncertain parameters θ conditioned on target data y while including prior knowledge on θ . Central to this framework is applying Bayes theorem p(θ |y) = p(y|θ )p(θ ) p(y) , with likelihood p(y|θ ), prior p(θ ), marginal p(y) and posterior p(θ |y). We denote all probability densities by p and distinguish them by their arguments. Combining Bayesian inference with conceptual climate models typically also yields the posterior of the model's fit to the data. With the ability to quantify fluctuations across timescales, power spectral analysis has improved the understanding of climate variability 3,[36][37][38][39] . Simple climate models have been combined with spectral analysis to explain timescaledependent variability. For example, Fredriksen and Rypdal 15 use a multibox EBM to study temporal scaling of temperature time series. Related works examine future projections 29 and climate sensitivity 40 . Soldatenko and Colman 41 study the sensitivity of the power spectrum on uncertainties in the parameters of a two-box EBM, considering stochastic noise but neglecting deterministic forcing. Yet, the potential of combining stochastic multibox EBMs, Bayesian inference and spectral analysis to study the magnitude of unforced and forced variability across timescales remains untapped.
Here, we examine and separate timescale-dependent internal and externally-forced contributions to the GMST variations. In particular, we analyze GMST variability during the last millennium (850-1850 CE) as simulated by 20 climate models of intermediate to high complexity. To this end, we combine a stochastic two-box EBM (section III A) with Bayesian inference (section III B) and spectral analysis (sec-tion III C). We present the "ClimBayes" software package 42 for Bayesian inference of climate parameters, which fits the stochastic EBM to GMST data. This results in the best estimate of the forced and samples of the forced + internal EBM's temperature response. First, we demonstrate our analysis on the example of historical observations (section IV A) and then apply it to the considered set of last millennium simulations (section IV B). Section IV C contrasts power spectra of the fitted EBM with and without internal variations. Comparing the internal and forced variance on interannual to centennial timescales (section IV D), a stochastic two-box EBM captures most variations of more comprehensive model simulations. We summarize and discuss the potential for physics-informed emulation of GMST data and separation of variance contributions across timescales in sections V and VI.

II. DATA
Our study relies on annual GMST and corresponding radiative forcing time series. We use full-forced last millennium runs from climate models of varying complexity (Tab. I). We analyze 10 simulations with atmosphere-ocean general circulation models (AOGCMs), considered in the Coupled Model Intercomparison Project 5 (CMIP5) 43 . Moreover, we use 10 simulations with Earth system models of intermediate complexity (EMICs) that are part of the IPCC's Fifth Assessment Report (AR5) 44 and described by Eby et al. 45 . The AR5 EMICs represent single simulations, except for CLIMBER2 and LOVECLIM V.1.2, which are ensemble means and denoted by "(mean)" in the following. To compare variability in single ensemble members to that of the ensemble mean, we use the five available ensemble members LOVECLIM V.1.2 (E1-E5).
The transient radiative forcing applied to these simulations follows the Paleoclimate Modelling Intercomparison Project Phase III (PMIP3) protocol 46 . For AR5 EMICs, we take the total estimated radiative forcing provided by Eby et al. 45 . For CMIP5 simulations, we use the radiative forcing from reconstructions of well-mixed greenhouse gases (CO 2 , CH 4 , and N 2 O), volcanic aerosols, total solar irradiance, and land use changes as provided by Schmidt et al. 46 . We neglect orbital forcing, which is assumed to play a negligible role for GMST variability over the last millennium. To remove potential unforced drifts of the simulated background climate 47 , the simulated GMST is linearly detrended prior to analysis. For consistency, detrending is also applied to the corresponding forcing time series. This does not affect our results, as the simulations' forcing input exhibits no trend for the last millennium (850-1850 CE). Both, temperature and forcing time series are considered as anomalies with respect to the starting year.
We use the GMST from HadCRUT5 48 observations (1850-2000 CE) to demonstrate the developed workflow of our Bayesian stochastic energy balance framework. As estimates for radiative forcing during the historical period, we consider the "PEA" land use (Pongratz et al. 49 ), "CEA" volcanic forcing (Crowley et al. 50 ), "SBF" solar irradiance reconstruction (Steinhilber, Beer, and Fröhlich 51 ) patched into Wang, Lean,

III. METHODS
Our analysis combines stochastic multibox EBMs, Bayesian inference and spectral analysis. We introduce the approach implemented in "ClimBayes" 42 for the most generic case of a stochastic EBM with N boxes in section III A. All results are obtained from the special case N = 2.

A. Stochastic two-box energy balance model (EBM)
The stochastic multibox EBM 13,15,29 extends the onedimensional linear operator from equation (1) by multiple vertical layers, approximating the heat exchange between surface and deep ocean layers. In matrix notation, the model reads 15 For N boxes, T (t) is an N-dimensional vector, describing the temperature of each box. By convention, T 1 corresponds to the temperature of the uppermost and T N to the temperature of the lowermost box. Accordingly, C is a diagonal matrix with the effective heat capacity C ii of each layer (i = 1, ..., N). K is a N-dimensional tridiagonal matrix, parameterizing the surface temperature response and vertical heat transfer (Appendix A). The time-dependent radiative forcing F (t) is only applied to the uppermost box, such that F 1 = F(t) and F k = 0 for k = 2, ..., N. The stochastic forcing (t) is likewise implemented with non-zero entry ε 1 (t) = σ W ξ (t). We motivate the white noise process ξ (t) with standard deviation (SD) σ W by the found impact of weather fluctuations 10,14,78 . Integrating equation (3) yields the solution of the surface temperature T 1 (t), given by a sum of the forced response T 1,F (t) and internal variations T 1,I (t) where C 1 = C 11 is the heat capacity of the uppermost box. This assumes no interaction between forced and internal variability on a global scale. The response function is uniquely defined 13,15 by the response parameters λ k and weights w k (k = 1, ..., N) with ∑ N k=1 w k = 1, which depend on the entries of C and K (Appendix A). The internal fluctuations T 1,I (t) in formula (4) represent an Itô-integral over the Wiener process W (s). Therefore, T 1,I (t) can be written as a weighted sum of Ornstein-Uhlenbeck (OU) processes, where the k-th OU process solves the stochastic differential equation dU(t) = −λ k U(t)dt + σ W C 1 dW (t) and receives weight w k . Accordingly, T 1,I (t) is normally distributed with mean zero. Its covariance matrix is determined by σ W , C 1 , w k , and λ k (Appendix B).

Joint emulation of forced and internal variations
To separate the internal and forced contributions to the GMST variance, we introduce the Bayesian inference algorithm implemented in "ClimBayes". We fit the linear stochastic two-box EBM to GMST data (Tab. I), as illustrated in Figure 1. The Bayesian inference algorithm relies on annuallyresolved temperature and forcing time series as input data. Moreover, it requires physics-informed prior information on θ = (λ 1 , λ 2 , w 1 ,C 1 , T 0 , F 0 ). The parameters λ 1 , λ 2 , and w 1 correspond to free parameters of the response function in equation (5). C 1 is the heat capacity of the upper ocean box. T 0 and F 0 are initial free parameters. T 0 allows for small deviations of the EBM solution to the input data in the starting year and is expected to be close to zero. Similarly, the additional parameter F 0 is needed to compensate for an initial forcing anomaly with respect to the equilibrium state.
We infer the posterior distributions of the uncertain parameters θ conditioned on target data y via Bayes theorem (2), using a Markov chain Monte Carlo (MCMC) algorithm. To this end, we assume that the target data can be described by a deterministic model Φ F and stochastic measurement or intrinsic noise Z that is also allowed to depend on θ : Formula (6) yields the likelihood p(y|θ ). Combined with prior information p(θ ), Bayes theorem (2) defines the posterior distribution p(θ |y). In our case, the deterministic model Φ F (θ ) is given by a discretization of the temperature responses T 1,F (t). The noise term Z(θ ) corresponds to the internal fluctuations T 1,I (t).
This approach provides a joint estimate of the internal and externally-forced response, based on the same physicsinformed response function and parameters. The best estimates of θ and the forced response T 1,F (t) are defined as their posterior means E[θ |y] and E[Φ F (θ )|y]. The SD σ I of the internal variability Z(θ ) determines σ W (Appendix B) and is approximated by the residuals' SD, that is, the difference between the target data and the forced response. Samples of the internal variations T 1,I (t) can be drawn from its covariance matrix, using the best estimates of θ (Appendix B). The forced + internal variations T 1,F (t) + T 1,I (t) represent the full response of the stochastic two-box EBM and, thus, provide a model for the target variability. To streamline our discussion, we will refer to the modeled forced and forced + internal response as emulation.

Numerical implementation
The "ClimBayes" package provides the numerical implementation of our approach (Fig. 1) and allows for straightforward adjustments via a configuration file. Most importantly, this includes specifications of the number of boxes, prior distributions, MCMC sampling properties, and fixed parameters. We choose N = 2 boxes (Appendix C) in line with Held et al. 12 and Geoffroy et al. 13 . Our experiments use independent prior distributions. We choose beta distributions with shape parameters α = β = 2 for the marginal priors of λ 1 , λ 2 as well as the initial parameters T 0 and F 0 . This allows for a physics-informed, fixed parameter range, where the mean is preferred over boundary values. The algorithm's convergence is improved compared to uniform priors. The intervals for the response parameters λ 1 : (0.2, 2) yr −1 and λ 2 : (0.005, 0.2) yr −1 are similar to those from Fredriksen and Rypdal 15 . Tailored to our goal to emulate interannual to centennial variability from last millennium simulations, our choice of priors assumes characteristic response times 1/λ k smaller than 200 years. These response times implicitly set a characteristic depth scale for the two ocean boxes of our stochastic EBM.
The prior of C 1 : (4, 11)Wyrm −2 K −1 follows previous findings 13,15,79 . Initial values T 0 : (−0.5, 0.5) K and F 0 : (−2, 2) Wm −2 are centered around zero. The weight w 1 : (0, 1) has an uniform prior. We consider no measurement noise, which is inherently fulfilled for simulated GMST. For observed GMST, we assume measurement errors to be small FIG. 1. Workflow of our Bayesian inference algorithm to emulate forced and internal GMST fluctuations. a The workflow builds on a linear stochastic two-box EBM, here in matrix notation. The solution for the surface temperature T 1 (t) is given by an integral with exponential response function R(t). b Required input data include annually-resolved GMST and time-dependent forcing, as well as physics-informed prior information p(θ ) on uncertain parameters θ . c We infer the parameters of the two-box EBM using a Markov chain Monte Carlo (MCMC) algorithm and Bayes theorem, assuming that the target GMST can be described as a deterministic model Φ F (θ ) and stochastic noise Z(θ ). d The workflow yields posterior distributions p(θ |y) of uncertain parameters conditioned on temperature data y and a physically-motivated emulation of the forced and internal variations. compared to internal fluctuations 48 . We verified that our findings are robust against reasonable variations of the MCMC and prior specifications.
To obtain the best parameter estimates, "ClimBayes" uses a Metropolis Hastings (MH) algorithm from the family of MCMC methods. To this end, we discretize the forward operator and compute annual temperature anomalies relative to the starting point t = 0 by a midpoint rule 29 : Here, the time step t j corresponds to the j-th year, and F(t j ) are forcing anomalies with respect to the starting point t = 0. F 0 represents an initial forcing anomaly. Discretizing T 1,I leads to a normally-distributed weighted sum of AR(1)processes with covariance matrix entries depending on λ k and w k (Appendix B). The likelihood p(y|θ ) is given by the normal distribution of T 1,F (t) + T 1,I (t). This requires the calculation of the covariance matrix for each sample in the Markov chain. For numerical robustness and computational efficiency, however, we approximate the likelihood function using an iterative scheme (Appendix B). The MCMC algorithm uses four chains with 25,000 samples each, from which the first 5,000 are discarded as burn-in. The proposal distribution is initially set to a normal distribution with mean zero and variances (0.2, 2, 1, 10, 1, 2) · 10 −5 for θ = (λ 1 , λ 2 , w 1 ,C 1 , T 0 , F 0 ). After 2,500 samples, the proposals are distributed according to the weighted sum of the initial normal proposal distribution and the empirical covariance matrix of previous samples.
We check the convergence of the Markov chains following two performance measures: First, the Gelman-Rubin diagnostics 80,81 compares the inter-chain and between-chain variances. It is ≤ 1.1 for most models, complying to recommendations 82,83 . Second, we use the Monte Carlo standard error 84 , which constructs an asymptotic confidence interval for the posterior mean 84,85 . In our experiments, the halfwidth of this interval is smaller than 5 % of the prior mean. We found that this criterion guarantees robustness of the parameter estimates when the same run is repeated multiple times or additional samples are added. DCESS ESM v1 is the only outlier, showing a tendency for bimodal posterior densities which lead to less stable estimates with wide CIs. However, we confirmed that convergence can be achieved by fixing the heat capacity to the estimated parameter.

C. Spectral analysis and variance ratios
Given a temperature time series T (t), the power spectral density (PSD) at frequency f corresponds to the Fourier transform of the autocovariance (8) with lag k = t 2 − t 1 and mean µ := E[T (t)]. This assumes X(t) to be weakly stationary, which is reasonably fulfilled after linear detrending the GMST data. Following Ellerhoff and Rehfeld 3 , we use the multitaper method with three windows to compute the PSD and χ 2 −distributed uncertainties. Mean spectra are obtained after interpolation to the lowest resolution and binning into equally spaced log-frequency intervals 86 . The spectra are visualized over periods τ = 1/ f and logarithmically smoothed using a Gaussian kernel of 0.04 decibels.
We compare the PSD for three types of time series: (1) the target temperature data from historical observations or climate model simulations, AR5 EMIC and CMIP5 models, (2) the emulated forced variations T 1,F (t), and (3) the emulated forced + internal temperature variations T 1,F (t) + T 1,I (t) from the stochastic two-box EBM. To compute the forced + internal PSD, we first sample 1000 realizations of the internal response T 1,I (t). We add these to T 1,F (t) and compute the PSD for all samples. Mean spectra and 95% confidence bands are obtained from this ensemble.

A. Example application to historical observations
We demonstrate the application of the Bayesian inference algorithm and our spectral analysis on the example of GMST observations from HadCRUT5 48 . Figure 2a shows the forcing and temperature time series together with the best estimate of the forced response T 1,F (t). The forced response follows the global warming trend and shows cooling periods after volcanic eruptions. Credible intervals (CIs) capture the uncertainties of the forced response, but not those due to internal variability. As a result, observations partly lie outside CIs. Uncertainties of the two-box forced response are largest at the time series' start. The power spectral density (Fig. 2c) of the forced response alone is smaller than that of the target temperature. Conversely, the magnitude of the emulated forced + internal PSD agrees with the target PSD within uncertainties, except for the interannual scale. While the emulated PSD constantly increases from interannual to multidecadal scales, HadCRUT5 shows a modulation with increased power on periods of two to ten years, which is not captured by the emulated response. Figure 2d compares the variance on interannual to multidecadal timescales. The variance ratios are formed by dividing the emulated forced or forced + internal variance by the variance obtained from the HadCRUT5 target. The forced variance is smaller than the target variance on all timescales. Incorporating internal variability reduces this mismatch strongly, yet, is not enough to capture all fluctuations on interannual and multidecadal scales. On decadal scales, the emulated forced + internal variability agrees well with the observations.

B. Parameters estimated from last millennium simulations
We use Bayesian inference to fit the stochastic twobox EBM to GMST simulations from CMIP5 models and AR5 EMICs (section II). Table II shows the best estimates, that is the posterior means, and 95% CIs of θ = (λ 1 , λ 2 , w 1 ,C 1 , T 0 , F 0 ) as well as the SD σ I of the internal variability T 1,I (t). Across all simulations, the short-term response parameter λ 1 varies between 1.91 and 0.31 yr −1 , spanning the full prior range between six months and five years. The longterm response λ 2 varies between 0.19 and 0.01 yr −1 , corresponding to characteristic timescales of approximately five to one hundred years.
Differences between AR5 EMICs and CMIP5 models are most pronounced for λ 1 . CMIP5 simulations exhibit larger CIs and inter-model differences, while AR5 EMICs show similar λ 1 , except for CLIMBER 2 (mean) and LOVECLIM (E1 and mean). The weight w 1 > 0.5 is typically larger than w 2 = 1 − w 1 , emphasizing the relative importance of the fast compared to the slow response. w 1 often is closer to unity for AR5 EMICs than for CMIP5 models. CLIMBER2 (mean) shows exceptionally large λ 1 and small w 1 . The heat capacity C 1 varies from 4.7 to 10.96 Wyrm −1 K −1 . It spans the full prior range for both AR5 EMICs as well as CMIP5 models and shows no strong dependence on other parameters. The initial temperature T 0 is well constrained and close to zero. Inter-model differences are also large for F 0 and linked to varying temperature amplitudes at the beginning of the time series with respect to the mean (Fig. 7). The SD σ I of the internal variability from CMIP5 models lies between 0.09 and 0.15 K and is larger than for AR5 EMICs (0.01 to 0.09 K).
C. Emulation of power spectral density Figure 3 compares the target PSD to the emulated forced and forced + internal PSD. For most simulations, the emulated forced + internal PSD agrees with the target within uncertainties. AR5 EMICs show no major differences between forced and forced + internal PSD above decadal scales, except for the LOVECLIM ensemble members. Hence, the forced response is sufficient to emulate the long-term variability of most AR5 EMIC simulations. On interannual scales, considering internal variability in the emulation compensates for the mismatch between the forced and target PSD. Differ-TABLE II. Posterior means and 95% CIs of uncertain parameters θ for the stochastic two-box EBM fitted to GMST from climate model simulations. The SD σ I of the internal variability equals the residual of the forced response (Appendix B) and is therefore given without CIs. Ensemble means are denoted by "(mean)". We exemplary show the LOVECLIM ensemble member "E1" as there are no major differences across the ensemble. Models with GRA 56 volcanic forcing are marked with an asterisk (*), all others use the CEA reconstruction 50 (Tab. I). The weight w 2 = 1 − w 1 (not shown) is uniquely defined by w 1 . ences between emulated and target PSD are most pronounced for CLIMBER2 (mean), showing an overestimation on multidecadal to centennial scales by the emulation. For CMIP5 simulations, the emulated forced PSD underestimates the target PSD on all timescales. Conversely, the forced + internal variability matches the target well for almost all models. Minor differences are found on interannual scales (Fig. 3(l) and 3(q)-3(u)). Here, the target PSD of MPI-ESM-P and HadCM3 exhibit increased power on periods of two to eight years. CCSM4, FGOALS-s2 and IPSL-CM5A-LR overestimate the PSD on the shortest timescales of approximately two years. The emulated forced + internal PSD for BCC-CSM1-1 deviates from that of the target by showing increased power on interannual and decreased power on multidecadal to centennial scales. Figure 4 shows the mean and spread of variance ratios on interannual to centennial scales for the considered model types. Variance ratios smaller than one indicate less emulated forced or forced+internal than target variance. We add a comparison of the variance from the five LOVECLIM V.1.2 ensemble members (E1-E5) (Fig. 4b) to that of the remaining AR5 EMICs (Fig. 4a) and CMIP5 models (Fig. 4c). Here, AR5 EMICs explicitly include the LOVECLIM ensemble mean, but not its members. For all model types, the relative contribution of internal variability decreases with increasing timescale, as the ratios for forced and forced + internal variance become more similar. The contribution of internal variability is larger in LOVECLIM ensemble members and CMIP5 simulations compared to AR5 EMICs.

D. Separating internal and externally-forced variance
The emulated variance of AR5 EMICs (Figure 4a) is dominated by forced variations and matches the target variance on interannual and decadal scales. On longer timescales, the emulated variance tends to overestimate the target variance. This is mostly due to outliers, which correspond to CLIMBER2 (mean), in line with the overestimated PSD in Figure 3b. Members of the LOVECLIM V.1.2. ensemble (Fig. 4b) exhibit more internal contributions to the variance on all timescales compared to AR5 EMICs. The ensemble's emulated forced + internal variance approximates the target well on the interannual and decadal scale. On larger timescales, we find an overestimation of the target variance.
The mean emulated forced + internal variance of CMIP5 simulations (Fig. 4c) is close to one on all timescales. The relative contribution of internal compared to forced variations is similar to that of LOVECLIM ensemble members on interannual and decadal timescales. However, there is a larger spread of variance ratios. Moreover, we find a small tendency of the emulation to overestimate the interannual and centennial variance of CMIP5 models. BCC-CSM1-1 represents an outlier, with the uppermost variance ratio on the shorter and lowermost on the longer timescales, in line with the spectral analysis (Fig. 3l).

V. DISCUSSION
We demonstrate the emulation of GMST variability as simulated by state-of-the-art climate models using a linear stochastic two-box EBM and Bayesian inference. Our analysis builds on the same, physically-motivated response function for internal and external processes, and allows for consistent separation of internal and externally-forced variability. Esti-  Fig. 1d). Bars indicate the mean variance ratio over the considered simulations for one model type. Circles correspond to individual simulations. Confidence bands for individual ratios (as in Fig. 2) are not shown for better visibility.) mates of the timescale-dependent variance show that the relative contribution of internal variability increases with model complexity and decreases with timescale.
Building on previous studies 29,31 , the strength of our Bayesian framework is that it yields the posterior means and CIs for the uncertain parameters of the stochastic two-box EBM fitted to GMST simulations. Due to our choice of priors and the convergence of the fit, the estimated heat capacity C 1 agrees by construction with previous findings 13,15,79 . Our response parameters λ 1 and λ 2 are consistent with results from Fredriksen and Rypdal 15 , estimated from observational data. However, the estimates differ from those obtained in 4xCO 2 experiments 13,79 . This is because our framework accounts for high-frequency pulses and, thus, estimates response parameters associated with faster dynamics. Furthermore, our findings reveal a dependence of the estimated response parameters on the imprint of intermittent volcanic eruptions on simulated temperatures. This is reflected in consistently high values for w 1 , emphasizing the fast feedback. The inter-model spread of λ 1 in CMIP5 simulations suggests a link to the implemented volcanic forcing: CMIP5 simulations driven by a comparatively weak reconstruction ("GRA") tend to show higher values for λ 1 compared to those driven by "CEA" (Tab. I), which has greater forcing amplitudes. λ 1 and w 1 are particularly high for BCC-CSM1-1, indicating a fast and weak forced response of the fitted EBM (Fig. 7). This is consistent with a weak forced response in BCC-CSM1-1 87 . The parameter estimates for CLIMBER2 (mean) differ from the remaining AR5 EMICs. In particular, λ 1 is poorly constrained, as CIs span the full prior range. We find that the temperature response to volcanic eruptions in CLIMBER2 (mean) is delayed. The estimated λ 1 and w 1 can be reconciled with those of the other AR5 EMICs if the temperature data were shifted by 1 yr. Altogether, the sensitivity of the fit to intermittent volcanic forcing suggests that high-frequency forcing plays a crucial role for simulating temperature variability across scales correctly.
Different methods have been developed to isolate forced and internal variations based on detrending 88 , single-model ensembles 5,[89][90][91] , and deterministic EBMs 92,93 , among others. The application of a Bayesian energy balance framework to the timescale-dependent quantification of forced and internal variance is novel. Our method provides a robust and joint separation of the variations at every step in time in a statistically sound way. Using data from the CESM Large Ensemble Community project 94 , we verify the robustness of our forced response (Fig. 8). The latter agrees well with the ensemble mean, which shows higher variability due to remaining internal variations that have not been averaged out over the 13 members. Hence, our method provides a robust tool to estimate the forced response when large ensembles are not available. Moreover, we find a wide agreement of the emulated forced + internal variance with that from CMIP5 simulations. The fact that the stochastic two-box EBM mimics the temperature variations well is in line with previous findings on a linear relation between external forcing and GMST 3,13,15,95 . Small differences between the emulated forced+internal and target PSD for CMIP5 models on interannual timescales can be attributed to the simplified representation of inter-nal variability as a weighted sum of AR(1) processes by the stochastic EBM. The latter cannot represent (pseudo-)oscillatory climate modes or modulations of internal variability by external forcing 96,97 , which could result in the observed underestimation of the PSD by the emulation on these timescales. For HadCM3 and MPI-ESM ( Fig. 3s and u), deviations on the interannual scale are similar to those in Had-CRUT5 and likely due to the spectral imprint of the El Niño-Southern Oscillation 3,98 . Moreover, our approach assumes that the covariance structure of the internal variations is determined by the estimated feedback parameters. In CMIP5 simulations with "GRA" volcanic forcing, particularly high values for the fast response parameter λ 1 and the weight w 1 lead to internal variations with autocorrelations on short timescales. This can cause an overestimation of the emulated PSD (Fig. 3 l, q, r, and t). Similarly, the overestimation of the target PSD on longer timescales (Fig. 3b) in CLIMBER 2 can be explained by its estimated slow response. The latter is due to the biased delay between forcing and temperature time series (Tab. II). Additionally, mismatches on short timescales can propagate to longer timescales, as in the case of BCC-CSM1-1 (Fig. 3l). Hence, explaining spectral properties of temperature time series not only requires consideration of stochastic noise 41 , but also precise knowledge of its correlation structure and the forced response. This highlights a need for simple, stochastic dynamical models 99 to simulate temperature fluctuations on long timescales.
The components of AR5 EMICs show a reduced number of scales compared to the AOGCMs, which simplifies the complexity of the processes contributing to variability. Therefore, the relatively strong forced variability in EMICs ( Fig. 4a and  b) is not a main deficiency, but serves to explore the longterm coupling between different Earth system components in response to radiative forcing. Compared to other AR5 EMICs, LOVECLIM V.1.2 features a more complex, threelayered atmosphere, which likely explains increased internal variations in the ensemble members. However, this variability is predominately short-term correlated. Due to the EBM's covariance structure, this can lead to an overestimation of the emulated forced+internal variability (Fig. 4b) on longer timescales. On interannual and decadal scales, the variance ratios based on the emulated forced and forced + internal variability from the LOVECLIM ensemble members (Fig. 4b) are similar to CMIP5 simulations (Fig. 4c). The similarity indicates that EMICs with a more realistic representation of atmospheric variability might better capture the relative contribution of forced and internal temperature variations on these timescales.
The contributions of internal variations on multidecadal scales and longer remain the largest in CMIP5 models, likely due to long-term variability mechanisms from the comprehensive ocean dynamics of AOGCMs. Compared to observations (Fig. 2 d), however, internal variability on decadal and multidecadal scales is smaller in CMIP5 simulations (Fig. 4 c). This is particularly interesting given the agreement of observed and simulated total GMST variability on these scales 2,3 . Smaller low-frequency internal variability in climate models than in observations 100 could be off-set by enhanced forced variability in response to volcanic eruptions 101,102 , such that the overall variance is largely conserved. We suspect that an incorrect ratio between internal and external variability could impact the long-term variability of simulated local temperatures. However, uncertainties in the interpretation of our HadCRUT5 findings arise due to the comparatively short time-span of the instrumental record and a possible change of the forced response under global warming 103 . Further investigating the spatial variability structure 104 and the link between local and global variability across climate states could help resolve mismatches between observed and simulated local variability on decadal and multidecadal scales.
One limitation of our study arises from the fact that the developed framework targets interannual to centennial timescales, and is therefore designed for annually-resolved GMST and forcing data. As a result, it cannot be readily applied to much shorter or longer timescales. Investigating the immediate effects of radiative forcing, for example, necessitates an extension to sub-annual resolution, and treatment of the seasonal cycle. An extension to coarser resolutions could be beneficial to study long-term changes such as millennialscale variability. However, such applications require careful examination of the underlying assumptions of the current framework, and an extension and validation of the estimation algorithm. The forced response is likely sensitive to modelspecific rapid adjustments 105 and uncertainties in the forcing. Applying the workflow to paleoclimate reconstructions could therefore be challenging. On the one hand, "ClimBayes" does not yet run at the best possible speed, as there are faster Bayesian algorithms 29 . On the other hand, "ClimBayes" represents an accessible, transparent, and well-documented numerical framework that can be easily adapted and extended, for example by integrating "Rstan" 106 or multilevel delayed acceptance MCMC 107 . Similar to Fredriksen and Rypdal 15 , we use a response function of exponential form, solving the ordinary differential equation (1). Future research could investigate the potential of Bayesian methods to find response functions describing the effects of climate forcing on different observables 108,109 . Furthermore, future studies could test our findings with more advanced climate models including better representation of, for example, land surface processes, atmospheric dynamics and chemistry, and sea ice. The presented framework can be also applied to single forcing experiments 71 for quantifying the contribution of single forcings to the spectrum. This will help better understand the climate system's response and interplay of intrinsic and external components in driving climate variability.

VI. CONCLUSION
We presented a physically-motivated emulation of GMST data using Bayesian inference and a stochastic energy balance model. Analyzing AR5 EMICs and CMIP5 simulations for the last millennium, we found that the power spectral density of the combined forced + internal response approximates the target spectrum well. We show that our emulation can be used to separate internal and forced contributions to GMST variability across timescales. The relative contribution of internal dynamics increases with model complexity and decreases with timescale. While AR5 EMICs predominately exhibit forced variations, simulations from CMIP5 models and the LOVECLIM ensemble members exhibit major contributions from the forced and internal response. This suggests that EMICs with more realistic atmospheric variability can simulate statistical properties of interannual to decadal climate fluctuations more reliably. Our results show that precise knowledge of the forced response and correlation structure of internal variability is necessary to explain variability across scales, needed to assess future variability and potentially associated risks with long-term projections. Our developed framework is robust, readily available and can thus be widely applied to describe, emulate and diagnose observed and simulated temperature variability.

DATA AVAILABILITY STATEMENT
The ClimBayes 42  The tri-diagonal matrix K of the multibox EBM in matrix notation (3) is given by The parameterλ controls the feedback of the surface layer. The coefficients κ 2 , ..., κ N > 0 describe the vertical heat transfer between ocean layers. The full solution to the multibox EBM 15 reads K is symmetric and negative definite, and, thus, diagonalizable. Multiplication with the positive diagonal matrix C −1 does not change this property. Accordingly, the matrix exponential exists for an orthonormal matrix V and eigenvalues −λ k of C −1 K. Since only the first component of the forcing vector F (t) is non-zero, the matrix entry (e C −1 K ) 11 defines the surface temperature response. The response function reads The normalization of the weights results from ∑ N k=1 w k = ∑ N k=1 (V k1 ) 2 = 1 for any orthonormal matrix V . The noise term in equation (4) is a weighted sum of Ornstein-Uhlenbeck (OU) processes. Consequently, its covariance structure results in: In the special case of N = 1, the covariance reduces to the covariance of a single OU process 40 Cov(T 1,I (t), T 1, follows from a generalization of this special case to arbitrary N. Discretizing the noise term in equation (4) results in a weighted sum of AR (1) processes. This sum Z is normally-distributed with mean zero and covariance matrix For given λ k , w k and C 1 , the SD of the stochastic forcing, σ W , uniquely defines the SD of the internal variations σ I := Cov(Z i , Z i ), and vice versa. It is possible to estimate σ I within the Bayesian framework. However, additional uncertainties arise from the fact that the covariance matrix in equation B2 is only an approximation to the true correlation structure of the residuals. As a result, the Bayesian estimation of σ I might not preserve the total variance. Therefore, we determine σ I from the residuals, that is the data minus the estimated forced response.

Iterative computation of the likelihood
Theoretically, the likelihood is given by a normal distribution with mean T 1,F and covariance matrix Cov(Z), depending on θ . However, computing the covariance matrix dynamically for each sample in the Markov chain can lead to difficulties: In particular, Cov(Z) needs to be inverted for every sample, which is computationally expensive. Moreover, the determinant det(Cov(Z)) can be close to zero, which can make numerical calculations unstable. Potential biases include decreasing goodness of fit and accuracy of estimated posteriors.
To solve this problem, we propose an iterative approach. This keeps λ k and w k in the covariance matrix fixed for each iteration of the algorithm. The first iteration uses the prior means for λ k and w k as well as a starting value for the ratio σ W /C 1 . It is not necessary to consider σ W and C 1 separately, since equations (B1) and (B2) depend only on their ratio. This ratio is chosen such that σ I = 0.1K for CMIP5 simulations and LOVECLIM ensemble members, and σ I = 0.05 K for AR5 EMICs. For the second iteration, the estimated posterior means of λ k and w k define the covariance matrix entries. Additionally, σ I is set to the SD of the residuals, which defines σ W /C 1 . The results of this second iteration are the posterior distributions for λ 1 , .., λ N , weights w 2 , ..., w N , heat capacity C 1 , initial forcing F 0 , and initial temperature T 0 . These iterations can be repeated and adjusted with "ClimBayes". We find that two iterations are enough to fit the forced + internal response to the considered data well, and that further iterations do not improve the goodness of fit.

Sampling from internal variability
Sampling internal variations T 1,I (t) requires the values of λ k , w k , and σ W in Cov(Z) (equation B1). λ k and w k are set to the posterior means. We calculate σ W from σ I , which we assume to equal the SD of the residuals. Appendix C: Comparison of one-, two-and three-box EBM We have verified that our results are robust against reasonable variations of the number of boxes. Here, we examine the difference between N ∈ (1, 2, 3) boxes on the example of the HadCRUT5 GMST (Fig. 6). We choose the priors of the response parameters to cover the same overall range as for the two-box model (N = 1: λ 1 ∈ (1/200, 2) yr −1 , and N = 3: λ 1 ∈ (1/5, 2)yr −1 , λ 2 ∈ (1/50, 1/5)yr −1 , λ 3 ∈ (1/200, 1/50)yr −1 ).
The stochastic two-box EBM fits the data more accurately (root mean square error: RMSE = 0.114 K) than the one-box EBM (RMSE = 0.133 K) (Appendix Fig. 6). The three-box EBM yields only minor improvements (RMSE = 0.0111 K). This pattern is consistent for AR5 EMICs and CMIP5 simulations, and reflected in similar forced responses and power spectral densities for N ∈ (1, 2, 3). Adding boxes, however, increases the risk of overfitting due to increasing degrees of freedom. This is reflected in increasing CIs for the forced response and parameters (Tab. III) with more boxes. That is FIG. 6. HadCRUT5 target observation (grey), the forced response from the one-, two-and three-box EBM fit to the data. We show their posterior means and the CIs (shaded) as well as their root mean square errors (RMSE).
why N = 2 represents the best compromise between goodness of fit, identifiability of parameters, and number of free parameters in our experiments.
Appendix D: Emulated forced temperature response for considered simulations Figure 7 shows the best estimate of the EBM's forced response, fitted to the target simulations from all considered models. CIs are much narrower and almost vanishing compared to HadCRUT (Fig. 2a). This is due to the fact that with increasing length of the time series the posterior uncertainties of the parameters and the forced response decrease. Figure 8 compares the emulated forced response against simulation data from the Last Millennium Ensemble of the Community Earth System Model (CESM) 110 . Forming the ensemble mean over the available 13 members serves to average out uncorrelated internal variability. Our emulated forced response, fitted to the first ensemble member (E1), shows a large overlap with the ensemble mean despite remaining internal variability that has not been averaged out.