Causal network reconstruction from time series : From theoretical assumptions to practical estimation

Causal network reconstruction from time series is an emerging topic in many fields of science. Beyond inferring directionality between two time series, the goal of causal network reconstruction or causal discovery is to distinguish direct from indirect dependencies and common drivers among multiple time series. Here, the problem of inferring causal networks including time lags from multivariate time series is recapitulated from the underlying causal assumptions to practical estimation problems. Each aspect is illustrated with simple examples including unobserved variables, sampling issues, determinism, stationarity, nonlinearity, measurement error, and significance testing. The effects of dynamical noise, autocorrelation, and high dimensionality are highlighted in comparison studies of common causal reconstruction methods. Finally, method performance evaluation approaches and criteria are suggested. The article is intended to briefly review and accessibly illustrate the foundations and practical problems of time series-based causal discovery and stimulate further methodological developments.


I. INTRODUCTION
Reconstructing the causal relations behind the phenomena we observe is a fundamental problem in all fields of science.The traditional approach is to conduct active experiments, but in many fields such as Earth system science or neuroscience, manipulations of the complex system under study are either impossible, unethical, or very expensive.On the other hand, modern science generates an ever-growing amount of data from these systems, in particular time series data.Concurrently, novel computing hardware today allows efficient processing of massive amounts of data.These developments have led to emerging interest in the problem of reconstructing causal networks or causal discovery from observational time series.
In the past few decades, a number of original causality concepts have been developed, such as Granger causality (Granger, 1969) or transfer entropy (Schreiber, 2000).Since the 1990s, computer scientists, statisticians, and philosophers have grounded causal reasoning and inference in a robust mathematical framework (Pearl, 2000;Spirtes et al., 2000).
The (quite natural) definition of causality underlying this framework is that X → Y if and only if an intervention or manipulation in X has an effect on Y (Pearl, 2000;Spirtes et al., 2000).This effect may be in changing Y 's mean or any change in its post-interventional distribution denoted P[Y | do(X = x)] which is different from the conditional distribution P(Y | X = x).Unfortunately, all we can measure from observational data are statistical dependencies.These can be visualized in a graphical model (Lauritzen, 1996) or time series graph (Eichler, 2011) that represents the conditional independence relations among the variables and their time lags (Fig. 1).The theory of causal discovery lays out the assumptions under which the underlying causal dependencies can be inferred from observational data.
There are different sets of assumptions that allow us to identify a causal graph.Here, we focus on time-lagged causal discovery in the framework of conditional independence testing using the assumptions of time-order, Causal Sufficiency, the Causal Markov Condition, and Faithfulness, among others, which are all discussed thoroughly in this paper.But some of these assumptions can be replaced.Recent work (Peters et al., 2017) shows ways to use assumptions on the noise structure and dependency types in the framework of structural causal models which can complement the approach studied here and we will include references to recent work from this framework throughout the sections.
The paper is organized as follows: In Sec.II, we relate Granger causality and similar concepts to the conditional independence-framework (Spirtes et al., 2000).Section III provides the necessary definitions and notation and in Sec.IV we recapitulate the assumptions underlying timelagged causal discovery from time series alongside illustrative examples.The practical estimation aspect from introducing 075310-2 J. Runge Chaos 28, 075310 (2018) FIG. 1. Causal network reconstruction.Consider a time series dataset (panel A) from a complex system of which we try to reconstruct the underlying causal dependencies (panel B), accounting for linear and nonlinear dependencies and including their time lags (link labels).Causal discovery aims to unveil spurious associations (gray arrows) which necessarily emerge due to common drivers (e.g., X 1 ← X 2 → X 4 ) or transitive indirect paths (e.g., X 3 → X 2 → X 1 ).Correlation matrices are, therefore, often very dense, while causal networks are typically sparse.(C) The time series graph defined in Definition 1 resolves also the time-dependence structure up to some maximum time lag τ max .A link X i t−τ → X j t (black edge) exists if X i t−τ and Y j t are not independent conditionally on the past of the whole process (gray boxes).some causal discovery algorithms to significance testing is presented in Sec.V, while Sec.VI discusses suggestions for performance evaluation.In Sec.VII, we present some comparison studies of common causal methods and conclude the paper with a brief discussion (Sec.VIII).The paper is accompanied by a python jupyter notebook on https://github.com/jakobrunge/tigramiteto reproduce some of the examples.Granger (1969), based on work by Wiener (1956), was the first to propose a practical, operational definition of causality based on prediction improvement.The underlying idea of measuring whether X Granger causes Y is that there is some unique information in X relevant for Y that is not contained in Y 's past as well as the past of "all the information in the universe" (Granger, 1969).In practice, typically only Y 's past is used (bivariate Granger causality).Measuring prediction improvement can be operationalized in different ways.The most common framework are vector autoregressive models (VAR),

II. FROM GRANGER CAUSALITY TO CONDITIONAL INDEPENDENCE
where X t = (X 1 t , . . ., X N t ), (τ ) is the N × N coefficient matrix at lag τ , τ max some maximum time lag, and η denotes an independent noise term.Here, X i Granger-causes X j if any of the coefficients ji (τ ) at lag τ is non-zero.A non-zero ji (τ ) can then be denoted as a causal link X i t−τ → X j t at lag τ .Another option is to compare the residual variances of the VAR fitted with and without including the variable X i .The use of VARs restricts this notion of causality to a causality in mean (Granger, 1969).A more general definition is that of (bivariate) transfer entropy (Schreiber, 2000;Barnett et al., 2009) where I(X ; Y | Z) denotes the conditional mutual information (CMI).Bivariate TE is a common term, another naming option would be bivariable TE since X and Y could also be multivariate variables.Transfer entropy can also be phrased in a multivariate (or multi-variable) lag-specific version (Runge et al., 2012a).Many current methods are advancements of the concept of transfer entropy (Wibral et al., 2013;Staniek and Lehnertz, 2008;Vejmelka and Palus, 2008), in particular in its multivariate version (Sun and Bollt, 2014;Sun et al., 2015;Runge et al., 2012b;2012a;Runge, 2015).Tests for causality are then based on testing whether a particular CMI is greater than zero.Looking at the definition of CMI, I(X ; Y | Z) = p(x, y, z) log p(x, y|z) p(x|z) • p (y|z) dx dy dz.
(3) TEbiv and its advancements essentially test for conditional independence of X and Y given Z, denoted X ⊥ ⊥ Y |Z since X ⊥ ⊥ Y |Z ⇐⇒ p(x, y|z) = p(x|z)p(y|z) ∀ x, y, z (4) (5) Z then represents Y 's past and other included variables.The lag-specific generalization of the VAR model (1) then is the full conditional independence (FullCI) approach where ).I can be CMI or any other conditional dependence measure.In the case of partial correlation, a non-zero entry ji (τ ) corresponds to a non-zero I FullCI i→j (τ ).A general concept to represent conditional independence relations among multiple variables and their time lags is that of time series graphical models (Eichler, 2011).

A. Definition of time series graphs
Consider a multivariate process X of dimension N. We define the time series graph G = (V × Z, E) of X as follows 075310-3 J. Runge Chaos 28, 075310 (2018) [Fig.1(c)]: The set of nodes in that graph consists of the set of components V at each time t ∈ Z.That is, the graph is actually infinite, but in practice defined up to some maximum time lag τ max .Compared to the general concept of graphical models (Lauritzen, 1996) for data without time-ordering, for time series graphs the time-dependence is explicitly used to define directional links in the set of edges E (Eichler, 2011).For convenience, we treat X, X t , and X − t as sets of random variables here and use the difference symbol "\" for sets.
Variables X i t−τ and X j t are connected by a lag-specific directed link " i.e., if they are not independent conditionally on the past of the whole process denoted by X − t = (X t−1 , X t−2 , . ..), which implies a lag-specific conditional dependence with respect to X.
For i = j, we call links autodependencies.For τ = 0, the same definition can also be used to define undirected contemporaneous links (see Eichler, 2011;Runge, 2015), which would lead to the time series graph being a mixed graph instead of a directed acyclic graph.The arrow of time is a convenient way to disambiguate independence relationships: If we do not have access to the time ordering of the variables (or there is none) and observe as the only conditional independence relation X ⊥ ⊥ Y | Z while all other relations are dependent, then this relation can be generated by any of the three causal motifs Here, we define links with lags t − τ relative to some time point t, but throughout this paper we assume stationarity (discussed in Sec.IV E).Then a link is repeated for every t < t if a link exists at time t.Alternatively, the links can be estimated from different realizations at time t.In practice, however, the links will mostly be estimated from single time series realizations requiring stationarity.
The parents of a node X j t are defined as P(X In the following, A, B, S denote nodes or sets of nodes in the graph and X t−τ , Y t , Z t−τ , U t−τ , Z random variables of the process X, sometimes dropping the subscript.We will denote a general conditional dependence measure as I(X ; Y |Z) which can be CMI or also some other measure such as partial correlation, depending on the context.

B. Separation
When considering the dependency between two variables X , Y given a set of conditions Z as in I(X ; Y |Z), the idea of open and blocked paths or separation between the corresponding nodes in the time series graph G [Fig. 1(c)] is important.A directed path is a sequence of linked nodes containing only motifs → • →.But there are also other paths on which information is shared even though no causal interventions could "travel" along these.In general (Eichler, 2011) Intuitively, if two nodes are separated, no information is shared between the two.For example, in Fig. 1 t is still open.These definitions are important for the relations between the graph and the underlying process.

IV. ASSUMPTIONS OF CAUSAL DISCOVERY FROM OBSERVATIONAL TIME SERIES
Causal information cannot be obtained from associations of measured variables without some assumptions.A variety of different assumptions have been shown to be sufficient to estimate the true causal graph (Spirtes et al., 2000;Peters et al., 2017).Here, we focus on three main assumptions under which the time series graph represents causal relations: Causal Sufficiency, the Causal Markov Condition, and Faithfulness.For time-lagged causal discovery from observational time series, we also need the assumptions of no instantaneous effects and stationarity.Further added to these are dependence type assumptions (e.g., linear or nonlinear) and no measurement error, and we will also assume that the joint distribution of the process has a positive density.All of these are discussed in the following.
For illustrating some of the assumptions in this paper, we will estimate the time series graphs by directly testing Definition 1 via Eq.( 6) [Fig. 1(c)], mostly in the partial correlation version.

A. Causal sufficiency
As Granger (1969) already notes, "[t]he one completely unreal aspect of the above definitions is the use of the series U − t representing all available information [in the universe]."This definition makes sure that the measured variables include all of the common causes of X and Y .However, we do not always need the whole universe.If we have available only a limited number of measured variables, we only need to assume that there exist no other unobserved (or latent) variables that directly or indirectly influence any other pair of our set of variables which is the assumption of Causal Sufficiency (Spirtes et al., 2000).
A set W ⊂ V × Z of variables is causally sufficient for a process X if and only if in the process every common cause of any two or more variables in W is in W or has the same value for all units in the population.

Example 1. (Unobserved variables)
What happens if such an unobserved (latent) confounder exists?Consider the example depicted in Fig. 2 where we assume no autodependencies.Here, U is an unobserved variable and drives both X and Z.With the time lags considered, this common driver leads to an association between X FIG. 3. Sub-sampled time series.If the time series of the true underlying process shown in the top panels (time series graph on the left, aggregated process graph on the right) is sampled at t = 2, the time series graph of the sub-sampled process (bottom left panel, note that t here refers to the subsampled time) here even has a reversed causal loop.and Z.The estimated time series graph [via Eq. ( 6)] among the observed variables (X , Y , Z) will then contain a link X t−1 → Z t even though it is a spurious association and any manipulation of X would not have any effect on Z. Here, U acts as a direct common driver leading to an induced association.But the estimated time series graph additionally contains a link Y t−2 → Z t even though there is no direct confounder between Y and Z.The reason is that Definition 2).The Fast Causal Discovery (FCI) algorithm (Spirtes et al., 2000;Zhang, 2008) does not assume causal sufficiency and allows us to partially identify which links are spurious due to unobserved confounders and also for which links confoundedness cannot be determined.The underlying idea is that if conditional independence holds between two variables for any subset (including the empty set) of W , then these variables are not linked.In the example above, this idea can be used to remove the link Y t−2 → Z t since Y t−2 ⊥ ⊥ Z t , i.e., Y t−2 and Z t are unconditionally independent.Latent causal discovery is further addressed, for example, in Entner and Hoyer (2010), Eichler (2013), Ramb et al. (2013), Smirnov (2013), Hyttinen et al. (2014), andGeiger et al. (2015).

Example 2. (Sub-sampled time series)
Causal sufficiency can also be violated if all variables are observed, but they are sampled at too coarse time intervals relative to the causal links.Consider a process with time series graph depicted in the top panel of Fig. 3 featuring a causal loop X → Y → Z → X with all causal links at lag τ = 1.If we sub-sample the time series with an original resolution of t = 1 at t = 2, we would estimate the causal graph from ( X , Ỹ , Z) as shown in the bottom panel of Fig. 3 that has a completely reversed causal loop.Looking at the top panel time series graph again, this spurious reversal can be understood: For example, in the path Z t−2 → X t−1 → Y t the node X t−1 is not sampled and, thus, unobserved, leading to a spurious link Zt−1 → Ỹt in the sub-sampled time series graph in the bottom panel (note that t is measured with twice the sampling rate then).Subsampled time series are an active area of research (Smirnov, 2013;Barnett and Seth, 2015;Spirtes and Zhang, 2016), to some extent sub-sampled time series graphs can be identified as addressed in Gong et al. (2015) and Hyttinen et al. (2016).

B. Causal Markov condition
All independence-based causal discovery methods necessitate the Causal Markov Condition (Spirtes et al., 2000) which constitutes a close relationship between the process X and its graph  (12).Since the present is not independent of the past given its parents, there are many spurious links if the graph is estimated with FullCI [Eq.( 6)].
This includes its contraposition from dependence follows connectedness.
Note that if Causal Sufficiency is not fulfilled, then also generally the Markov condition will not hold (Spirtes et al., 2000).Intuitively, the Causal Markov Condition implies that once we know the values of Y t 's parents, all other variables in the past (t − τ for τ > 0) become irrelevant for predicting Y t .Of course, Y 's descendants at future time points can also "predict" Y t .

Example 3. (Non-markovian processes)
A typical example of a non-markovian process is an autoregressive process driven by 1/f noise where the power spectrum is inversely proportional to the frequency of the signal.Consider a process generated by where η i t for i = X , Y , Z is 1/f noise.Such noise terms are not independent in time anymore, even though the noise terms between each individual variable are still independent, i.e., η Here, P Z t = {Y t−2 }, but still we observe many more links in the time series graph (estimated with FullCI in Fig. 4), both within one variable and between variables.This means that separation in the graph does not imply independence in the process, and the Markov condition is violated.

Example 4. (Time aggregation)
Another example where noise terms become dependent is time-aggregation.Consider the causal chain of processes X t−2 → Y t−1 → Z t shown in the top panel of Fig. 5. Time aggregation of a time series realization with t = 2 is here done by constructing the new time series Xt = 1 2 (X t + X t−1 )∀t and correspondingly for Y , Z. Now, we observe additional contemporaneous links next to the directed links Xt−1 → Ỹt and Ỹt−1 → Zt in the time series graph of the aggregated process due to the too coarse time resolution.But, furthermore, we also get spurious directed links, for example between X and Z.In general, the causal structure of an aggregated process may be very different from the original process.
Time aggregation is an important issue in many applied fields.For example, in climate research time series are frequently measured daily and then aggregated to a monthly scale to investigate dependencies (Runge et al., 2014).Ideally, the time resolution is at least as short as the shortest causal time lag (assuming no instantaneous effects, see Sec.IV D).See Breitung and Swanson (2002) and Barnett and Seth (2015) for a discussion on temporal aggregation in time series models.Ignoring time order, in some cases the recent methods discussed in Peters et al. (2017) can help.

C. Faithfulness
The Causal Markov Condition guarantees that separation in the graph implies independence in the process.But what can be concluded from an estimated conditional independence relation, that is, the reverse direction?Faithfulness guarantees that the graph entails all conditional independence relations that are implied by the Markov condition.FIG. 5. Time aggregation.If the time series of the true underlying process shown in the top panel is aggregated at t = 2, the time series graph of the aggregated process (bottom panel, note that t here refers to the aggregated time) has contemporaneous dependencies due to the too coarse time resolution compared to the lags of the causal links, but also many spurious directed links.

The joint distribution of a time series process X with graph G fulfills the Faithfulness condition if and only if for all disjoint subsets of nodes (or single nodes) A, B, S ⊂
that is, from independence follows separation, which includes its logical contraposition from connectedness follows dependence.

The combination of Faithfulness and the Markov property implies that
Both conditions are an important assumption for causal discovery algorithms as discussed in Spirtes et al. (2000).Intuitively, Faithfulness together with the Causal Markov Condition allow us to conclude that (in the limit of infinite sample size) a measured statistical dependency is actually due to some (not necessarily direct) causal mechanism and, conversely, a measured independence (given any set of conditions) implies that no direct causal mechanism exists (see also Remark 1 in the Discussion).

Example 5. (Counteracting mechanisms)
In a linear model [e.g., Eq. ( 1)], the coefficient values form a real space and the set of points in this space that create vanishing partial correlations not implied by the Causal Markov Condition have Lebesgue measure zero (Spirtes et al., 2000).One can, thus, argue that non-faithful distributions arise from an unrealistic fine-tuning of dependence parameters.However, approximately vanishing partial correlations despite connectedness in the graph can also occur for a distribution that is faithful, but almost unfaithful, if we have only a limited sample size available as discussed in Uhler et al. (2013).An example of an unfaithfully fine-tuned process is As shown in Fig. 6, here X influences Z directly as well as indirectly through Y .Now simple algebra shows that implying that Z and X are unconditionally independent since both mechanisms counteract each other even though there is a link in the graph.In Runge (2015) such counteracting interdependencies are analyzed information-theoretically.

Example 6. (Determinism)
One may argue that we live in a deterministic world and the assumption of "an independent noise term" that is pertinent to statistics is unrealistic.On the other hand, for a FIG. 6. Example of unfaithfully fine-tuned process where X and Z are independent even though they are connected in the graph.Here, the indirect mechanism X → Y → Z with a positive effect counteracts the direct link X → Z with a negative effect.
given observed variable Y , the complexity of the underlying processes will almost always imply that Y does not deterministically depend on its parents, but some unresolved processes constitute "intrinsic" or "dynamical" noise that is also driving Y .
Determinism violates Faithfulness as follows.Consider the model with c > 0 and some functions f , g.Here, we have Cover and Thomas, 2006)} implying X ⊥ ⊥ Y | Z even though Y depends on X in the model.One can argue that X should not be considered as an autonomous causal variable in this example and instead consider Z → Y as the causal graph for this model writing the model above as Determinism in causal inference can to some extent be addressed in the conditional independence framework (Spirtes et al., 2000) or using structural causal models in Janzing et al. (2012) and Daniusis et al. (2012).
The former example only illustrated a static case of determinism.The field of nonlinear dynamics studies the properties of nonlinear and chaotic dynamical processes which has led to a plethora of nonlinear time series analysis methods (Kantz and Schreiber, 2003), often from an information-theoretic perspective (Hlavácková-Schindler et al., 2007) including transfer entropy.Many of these methods built on the assumption that no system is perfectly deterministic, for example, due to the coarse-graining of the system's phase-space in the measurement process.In Sec.VII A, we study the effect of dynamical noise on several common time-series based causal discovery approaches for chaotic systems.

Example 7. (Non-pairwise dependencies)
Next to the fine-tuned example on counteracting mechanisms, Faithfulness can also be violated for a dependency realized in an XOR-gate.Suppose, as shown in Fig. 7 among several "faithful" pairwise dependencies, we have that Y = X 1 ⊕ X 2 + η Y with X 1 , X 2 being binary random processes with equal probability for all values of X 1 and X 2 .Then I(X 1 ; Y ) = I(X 2 ; Y ) = 0 even though both are connected to Y -a violation of Faithfulness.Here, the full information is ).Thus, the MI is zero, but the CMI is not and there is, hence, a link in the time series graph.
Note that already if P(X 1 ) = P(X 2 ), Faithfulness is not violated anymore as analyzed in Sun et al. (2015), showing that Faithfulness violations are rather pathological.Another form of synergy without violation of Faithfulness is the case that Y = X 1 X 2 + η Y , where we have a rather weak MI I(X 1 ; Y ), but again a much larger I[(X 1 , X 2 ); Y ].In Runge et al. (2015a), synergy is analyzed in the context of optimal prediction schemes.
As pointed out in James et al. (2016) for synergistic dependencies, the problem is that the concept of a pairwise dependency graphical model does not apply, but hyper-graphs are needed to represent such dependencies.Causal discovery of such graphs, however, carries the problem of combinatorial explosion if links between sets of nodes are considered.

D. Instantaneous effects
Granger causality and the definition of time series graphs are examples for lagged definitions of causality.To guarantee that the lagged parents defined in Eq. ( 8) are sufficient for the Causal Markov Condition to hold, we need to assume that there are no instantaneous (contemporaneous) causal effects, i.e., X i t → X j t .One may argue that causality between dynamical systems cannot have instantaneous effects because the speed of light is finite and, if the process is sampled with sufficient resolution (otherwise, see the Examples of sub-sampling and aggregation), we only need to consider lagged causal effects.However, we often do not have a sufficiently sampled time series.Here, recent developments in causal inference theory (Zhang and Hyvärinen, 2009;Peters et al., 2013;Lopez-Paz et al., 2015;Spirtes and Zhang, 2016;Peters et al., 2017) address instantaneous causality within the framework of structural causal models which can be applied to determine causal directionality for contemporaneous links.These models work under assumptions on the noise in the model such as non-Gaussianity.Also, the logical causal orientation rules of the causal discovery approaches in Spirtes et al. (2000) can be used to partially orient contemporaneous links.

E. Stationarity
To estimate the time series graph defined in Definition 1 from time series data, we assume stationarity.Another option would be to utilize independent ensembles of realizations of lagged processes.Here, we define stationarity with respect to a time index set T .For example, T can contain all time indices belonging to a certain regime of a dynamical process, e.g., only winter months in the climate system.Definition 6. (Causal stationarity).
The time series process X with graph defined in Definition 1 is called causally stationary over a time index set T if and only if for all links X i t−τ → X j t in the graph This constitutes actually a weaker form of stationarity than the common definition of stationarity in mean, variance, spectral properties, or of the value of individual coefficients in a linear model.For example, one could require that all CMIs are stationary, which is a much stronger statement.The strength of causal mechanisms may fluctuate over time and the causal stationarity assumption only requires conditional independence to be stationary.

Example 8. (Non-stationarity due to confounding)
Consider the data shown in Fig. 8 and suppose we first only have access to the variables (X , Y , Z).Clearly, the time series of this subprocess are nonstationary in a classical sense, varying over time not only in their mean but also in their spectral properties.Estimating the time series graph on these three variables results in the graph shown in the bottom left panel of Fig. 8, where the common nonstationary trend leads to an almost fully connected graph.
A typical example of a common nonstationarity, albeit not the same as in our example, is found in climate time series which are usually all driven by solar forcing leading to a common seasonal signal.In climate research the time series are typically anomalized, that is, the seasonal signal is estimated and subtracted from the data (Storch and Zwiers, 1999) which is equivalent to regressing out its influence.But this is not always possible, in our example the common signal is not purely periodic and cannot easily be estimated from the data.Another option for the case of piecewise stationary processes is to include background knowledge on the stationary regimes and estimate the graphs separately for the stationary subsets of T .For example, the climatic seasons El Niño and La Niña lead to different causal directions of surface temperature anomalies in the tropical Pacific (Philander, 1985).Prior knowledge of when the seasons start and end allow us to restrict the estimation of time series graphs to samples within a particular season.Now suppose we actually have access to the common signal U and include it in our analysis (but without estimating the parents of U, i.e., treating U as exogenous).Then, as shown in the bottom right panel of Fig. 8, we can recover the true causal structure where U is a confounder of the three variables while they are connected through the motif X ← Z → Y .Thus, here nonstationarity is a result of confounding and can be removed if we have access to the underlying trend.
The key point is that the causal structure, that is, the time series graph, of the whole process (X , Y , Z, U) is invariant in time.One may argue that causal laws are generally invariant and non-stationarity is simply a problem of violation of causal sufficiency.The idea of finding invariant predictors for causal inference is explored in Peters et al. (2016).

F. Dependency type assumptions
To test conditional independence hypotheses X ⊥ ⊥ Y | Z, different test statistics can be utilized.These are typically based on making certain assumptions about the type of the underlying dependency structure.While classical statistical methods are often based on the assumption of linearity (which allows us to derive rigorous results), modern statistics, the physics community, and the recent field of machine learning have developed non-parametric or model-free methods that allow us to better capture the nonlinear reality of many dynamical complex systems-at the cost of weaker theoretical results.Conditional independence testing can be classified into regression-based and model-free approaches.Here, we only discuss tests for continuously valued variables, for discrete variables one can, for example, use methods based on contingency tables (Spirtes et al., 2000) or discrete CMI estimation (Cover and Thomas, 2006).
Regression-based conditional independence tests of X ⊥ ⊥ Y |Z are based on first regressing out the influence of Z from X and Y and then testing the dependence between the residuals.We first fit a model assuming for centered variables X , Y and independent and identically normally distributed X ,Y .Now further restrictions can be laid upon the functional form of f X ,Y .For example, the partial correlation test assumes linearity, while a non-parametric regression can be based on Gaussian Process regression (Rasmussen and Williams, 2006).
Secondly, from the estimated functions f , the residuals are formed as Finally, the dependence between the residuals can be tested with different pairwise association tests.For partial correlation this is a t-test, while the dependence between the residuals of a non-parametric regression can be tested with non-parametric tests (Gretton et al., 2008;Székely et al., 2007) such as the distance correlation coefficient R(r X , r Y ) (Székely et al., 2007) (see also Sec.V C).Note that these models all make parametric assumptions and, thus, do not estimate conditional independence in its most general form.
The other extreme to partial correlation are model-free methods that directly test conditional independence.The most prominent test statistic is CMI as defined in Eq. ( 3), for which non-parametric estimators based on nearest-neighbor statistics exist (Kraskov et al., 2004;Frenzel and Pompe, 2007;Vejmelka and Palus, 2008;Póczos and Schneider, 2012) [see also Gao et al. (2015) and Lord et al. (2018) for recent progress on nearest-neighbor entropy estimators].Other possible conditional independence tests are Kernel Conditional Independence Tests (Zhang et al., 2011;Strobl et al., 2017) which essentially test for zero Hilbert-Schmidt norm of the partial cross-covariance operator or conditional distance correlation (Wang et al., 2015).Some new recent tests are based on neural networks (Sen et al., 2017) or decision tree regression (Chalupka et al., 2018).In Runge (2018), a conditional independence test based on CMI is introduced.

Example 9. (Nonlinearity)
Figure 9 gives an overview over different types of linear and nonlinear relationships of the form In all cases, we have X ⊥ ⊥ Y | Z.
For the linear case (first row in Fig. 9), we consider f • = cZ t−1,2 + η • t and the regression-based techniques correctly fit the dependencies of the pairs (X , Z) and (Y , Z) (red fit lines in gray scatterplots), and, thus, correctly identify the independence of the residuals (black scatterplot).A modelfree test also correctly identifies the common driver motif here.
For the nonlinear additive noise case with a quadratic dependency f Finally, if the dependencies are multiplicative (bottom row) as in f t , both regression methods fail.Then the residuals are nonlinearly related which is not detected with a partial correlation test (here two errors somewhat cancel each other out).A non-parametric test on the residuals, on the other hand, then wrongly estimates the spurious link X → Y (gray dashed in center bottom row).
Model-free methods in principle can deal with all these cases, which might lead to the conclusion that they are superior.But the "no-free-lunch-theorem" tells us that such generality has a price and model-free methods are very data-hungry and computationally expensive.If expert knowledge pointing to a linear or otherwise parametric dependency is available, then regression-based methods will typically greatly outperform model-free methods.

G. Measurement error
Measurement error, unlike dynamical noise, contaminates the variables between which we seek to reconstruct dependencies and constitutes a difficult problem in causal network reconstruction (Scheines and Ramsey, 2016).

Example 10. (Observational noise)
Here, we only discuss measurement error in its simple form as observational noise, which can be modeled as Z = Z + Z .Such observational noise presents at least two sorts of problems for causal discovery.
Firstly, observational noise attenuates true associations and, therefore, lowers detection power.This is because in general I( X ; Ỹ ) = I(X + X ; Y + Y ) ≤ I(X ; Y ) which is a consequence of the data processing inequality (Cover and Thomas, 2006): Manipulating a variable can only reduce its information content.In Fig. 10, we added normal observational noise with σ = 20 to Z. Then the links Z → X and Z → Y cannot be reconstructed anymore.Secondly, too much noise on conditioning variables makes it impossible to preserve conditional independence.In Fig. 10 effect of observational noise on causal discovery is discussed further in Smirnov (2013); Scheines and Ramsey (2016) and Runge et al. (2018) give some numerical experiments for high-dimensional causal discovery.

V. PRACTICAL ESTIMATION
The previous sections concerned fundamental assumptions of time-lagged causal discovery based on conditional independence relations.We now turn to the topic of practical estimation where we introduce several common causal discovery methods and discuss their consistency, significance testing, and computational complexity.We restrict the analysis to the class of conditional independence approaches, which flexibly allows us to use different independence tests.But graphical models, in general, can also be estimated with scorebased Bayesian methods, e.g., the max-min hill-climbing algorithm (Tsamardinos et al., 2006).

A. Causal discovery algorithms 1. Granger causality/transfer entropy/FullCI
Transfer entropy, as introduced by Schreiber ( 2000), is a direct information-theoretic implementation of Granger causality (Barnett et al., 2009).In a lag-specific implementation, as given by FullCI [Eq.( 6)], it tests for conditional independence between each X i t−τ and X j t conditioned on the entire past X (t−1,...,t−τ max ) t (excluding X i t−τ ).As the experiments in Sec.VII will demonstrate, this approach strongly suffers from the curse of dimensionality.Sun and Bollt (2014) and Sun et al. (2015) developed a discovery algorithm based on the information-theoretic optimal causation entropy principle (algorithm abbreviated as OCE) which reconstructs the lagged parents of a variable X j t by an iterative procedure alleviating the curse of dimensionality: Starting with an empty parent set P OCE (X j t ) = ∅, first the MIs I(X i t−τ ; X j t ) for all X i t−τ ∈ X − t are estimated.As the first parent X (1) , the one with the largest MI with X j t is selected.The next parent X (2) , however, is chosen according to the largest CMI I(X i t−τ ; X j t |X (1) ) among all remaining variables, the third parent is the one with largest CMI conditional on the two previously selected parents, etc.The process is continued the CMI of a selected parent is non-significant.This forward-selection stage is followed by a backward elimination whereby the significance of each of the parents X i t−τ ∈ P OCE (X j t ) is tested conditional on the remaining parents: OCE:

Optimal causation entropy
The significance of CMIs can be tested with a nearestneighbor CMI estimator (Kraskov et al., 2004;Frenzel and Pompe, 2007;Vejmelka and Palus, 2008) in combination with a permutation test where X i t−τ is randomly shuffled.Of course, the conditional independencies in Eq. ( 26) can also be tested with other test statistics.

PC algorithm
An alternative to this forward-backward scheme is the PC algorithm (named after its inventors Peter and Clark) (Spirtes and Glymour, 1991).The original PC algorithm was formulated for general random variables without assuming a time order.It consists of several phases where first, in the skeletondiscovery phase, an undirected graphical model (Lauritzen, 1996) is estimated whose links are then oriented using a set of logical rules (Spirtes and Glymour, 1991;Spirtes et al., 2000).A later improvement led to the more robust modification called PC-stable (Colombo and Maathuis, 2014).
For the case of time series, we can use the information of time order which naturally provides an orientation rule for links.The algorithm then is as follows: For every variable X j t ∈ X t it starts by initializing the preliminary parents P(X j t ) = (X t−1 , X t−2 , . . ., X t−τ max ).In the first iteration (p = 0), we remove a variable X i t−τ from P(X j t ) if the null hypothesis cannot be rejected at a significance threshold α.Then, in each next iteration, we increase p → p + 1 and remove a variable The forward-backward scheme of OCE conducts conditional independence tests only using the conditions with highest CMI in the preceding stage and quickly increases the number of conditions.This can lead to wrong parents being kept in POCE (X j t ) (see Sec. V B) which are only removed in the backward stage where the dimensionality of the set POCE (X j t ) can be already quite high.High dimensionality, in principle, leads to lower detection power (Example VII C).The PC algorithm conducts conditional independence tests not only using the condition with highest association, but it goes through (in theory all) combinations of conditions S which can help to alleviate the curse of dimensionality regarding the estimation dimension of I(X i t−τ ; X j t | S) compared to OCE.On the other hand, the PC algorithm conducts many more tests which increases other problems (see Sec. V C).

PCMCI
A more recent approach that addresses some of the shortcomings of the PC algorithm above is PCMCI (Runge et al., 2018).PCMCI is a two-step approach which uses a version of the PC-algorithm only as a condition-selection step (PC 1 algorithm) to obtain P(X j t ) for all X j t ∈ X t , followed by the momentary conditional independence (MCI) test for 075310-11 J. Runge Chaos 28, 075310 ( 2018) The main difference between PC and PC 1 is that PC 1 tests only the condition subset S with largest association instead of going through all possible combinations.To this end P(X j t ) is sorted after every iteration according to the test statistic value and S is determined by the first p variables in P(X j t ) (excluding X i t−τ ).This leads to less tests, but still provably removes incorrect links.
The MCI test is the most important difference to the PC algorithm and the approach by Sun and Bollt (2014).The additional conditioning on the parents P(X i t−τ ) in MCI accounts for autocorrelation leading to well-controlled false positive rates at the expected level (Runge et al., 2018).A variant (PCMCI 0 ) where the condition on the parents P(X i t−τ ) is dropped leads to a very similar approach to OCE.PCMCI, like FullCI, OCE, and PC, can be implemented with different conditional independence tests.For further details on PCMCI, see Runge et al. (2018).In Sec.VII we compare FullCI, OCE, and PCMCI in a number of numerical comparison studies.

B. Consistency
Consistency is an important property of causal methods that tells us whether the method provably converges to the true causal graph in the limit of infinite sample size.Consistency concerns the conditional independence tests on the one hand, but also the causal algorithm in the case of iterative approaches such as those discussed in Sec.V A.
For example, for the consistency of the non-parametric regression independence tests in Eq. ( 23), we need to assume that the function estimator converges to the true function, that the noise in the model is additive and independent, and finally that we have a consistent unconditional independence test for the residuals.With a consistent test, the time series graph can be directly estimated based on Definition 1.For iterative causal algorithms, we can define universal consistency as follows.

Denote by G n the estimated graph of some causal estimator from a sample of a distribution P with sample size n and by G the true causal graph. Then a causal estimator is said to be universally consistent if G n converges in probability to G for every distribution P,
That is, the probability of estimating the wrong graph becomes arbitrarily small if enough data is available, for any distribution P (hence "universal").Consistency has been proven for classical causal discovery algorithms such as the PC-algorithm (Spirtes et al., 2000), the optimal causation approach Sun and Bollt (2014), Sun et al. (2015) and PCMCI Runge et al. (2018), as an approach based on PC.
However, universal consistency is a weaker statement than, for example, uniform consistency which bounds the FIG.11.Example where for certain a, b the MI I(X ; Y ) can be larger than any of the MIs I(Z 1 ; Y ) or I(Z 2 ; Y ).Thus, the most strongly associated variable with Y is actually not a causal driver.
error probability as a function of the sample size n giving a rate of convergence.Thus, for a non-uniform, but only universally consistent method, the sample size at which a given error can be guaranteed and can be different for every distribution P. Robins et al. (2003) showed that no uniformly consistent causal discovery algorithm from the class of independence-based approaches (Spirtes et al., 2000) exists since the convergence can always be made arbitrarily slow by a distribution that is almost unfaithful with some dependencies made arbitrarily small.Uniform consistency for conditional-independence based algorithms can only be achieved under further assumptions such as having strong enough dependencies (Kalisch, 2008).

Example 11. (An inconsistent causal algorithm)
Consider again the forward-selection stage of the OCE algorithm (Sun and Bollt, 2014;Sun et al., 2015) introduced in Sec.V as a standalone method to reconstruct parents of a variable Y t ∈ X t .Even though the scheme sounds appealing and efficient, the scheme alone is not a consistent estimator of causal graphs.It yields a superset of the parents (Sun et al., 2015) which may also contain false positives: Consider the example graph shown in Fig. 11.Here, the causal parents of Y are Z 1 , Z 2 (dropping time subscripts t here).If forward-selection alone was a causal approach, then in each step the variable with strongest association would also need to be a causal parent.But in this example the MI between X and Y can be larger than the MIs of Z 1 and Z 2 with Y .For example, for a = 0.5, b = 2 we have I(X ; Y ) ≈ 0.13 nats while I(Z 1 ; Y ) = I(Z 2 ; Y ) ≈ 0.06 nats.See Appendix A for an information-theoretic analysis.Hence, the wrong parent X is selected.This scheme, thus, requires the second step of the OCE approach, given by Eq. ( 26).

C. Significance testing
How can we assess the statistical significance of conditional independence tests on which the causal algorithms in Sec.V A are based, such as the tests discussed in Sec.IV F? Using a test statistic I n (x; y | z) (I here stands not only for CMI, but any conditional independence test statistic) for the observed samples (x, y, z) = {x i , y i , z i } n i=1 we wish to test the hypothesis 075310-12 J. Runge Chaos 28, 075310 (2018) versus the general alternative To assess the significance of an outcome of such a test using p-values, we need to know the distribution Pr(I n | H 0 ) of the test statistic under the null hypothesis.For partial correlation tests, exact analytical expressions of the null distribution exist under certain assumptions, but for nonlinear tests (such as CMI or also non-parametric regression tests) these are typically not available except for some asymptotic large sample size cases (Strobl et al., 2017).The alternative then are permutation tests as discussed in Example 12.
Given the null distribution, the p-value is defined as the probability-given H 0 -of observing a value of the test statistic that is the same or more extreme than what was actually observed.If our test statistic is non-negative (such as CMI), the p-value for an observed test statistic value I n is defined as p = Pr(I n ≥ I n | H 0 ).Choosing a significance level α, we reject the null hypothesis if p < α.
There are two types of errors we can make.Rejecting H 0 when H 0 is true is called a type I error or false positive.Retaining H 0 when H 1 is true is called a type II error or false negative.We also call one minus the type II error rate of a test the true positive rate or detection rate.
If the test statistic has a continuous distribution, then under H 0 the p-value has a uniform (0, 1) distribution (Wasserman, 2004).Therefore, if we reject H 0 when the p-value is less than α, the probability of a type I error is α.For a well-calibrated test under H 0 , we thus expect to measure on average a false positive rate of α.If this is not the case, the test is ill-calibrated indicating that we got the null distribution wrong because certain assumptions, such as independent samples, are violated.In Examples 12 and 13, we discuss such cases.

Example 12. (Permutation testing)
Permutation testing is straightforward in the bivariate independence test case.To create an estimate of the null distribution of a test statistic To achieve a test under the correct null hypothesis, we can use a local permutation scheme that preserves the associations between x and z.Runge (2018) suggests such a scheme depicted in the bottom panel of Fig. 12 which only permutes those x i and x j where z i ≈ z j .This scheme can be used for CMI conditional independence testing or also other test statistics.Other schemes are discussed in Doran et al. (2014) and Sen et al. (2017).

Example 13. (Non-independent samples)
A basic assumption underlying many conditional independence tests is that the samples are independent and 075310-13 J. Runge Chaos 28, 075310 (2018) identically distributed (i.i.d.).Unfortunately, time series are typically dependent in time.To take this dependence into account, one can either adapt the distribution under the null hypothesis, for example, in partial correlation t-tests by estimating the effective degrees of freedom, which is, however, difficult in a multivariate setting.Or one can modify the test statistic to explicitly account for autocorrelation (Runge et al., 2018).Also, a permutation scheme needs to be adapted to preserve auto-dependencies, for example, by shuffling blocks of samples (Peifer et al., 2005).For bivariate tests such an approach is again straightforward, but not for the multivariate case as analyzed in the autocorrelation comparison study in Sec.VII.

Example 14. (Sequential testing of causal links)
The preceding discussion concerned tests of an individual conditional independence relationship.Directly testing causal links via Definition 1, thus, gives us a well-calibrated test if all assumptions are fulfilled.
However, in iterative causal algorithms (Sec.V A) such as the PC algorithm, OCE, or the first step of PCMCI (PC 1 ) multiple tests on a particular link X t−τ → Y t are conducted with different condition sets that are determined by the outcome of previous tests.If a link is found non-significant in any of these tests, this link is removed.Since the tests are not independent of each other (since they are typically based on the same data sample), it is almost impossible to derive a combined p-value of all these tests.
Figure 13 depicts an illustrative numerical example where the combined false positive rate is much lower than the 0.05 of each individual test and the true positive rate is lower than the minimal true positive rate among the individual tests.In summary, even though all assumptions may be valid, sequential testing makes a significance assessment difficult.These issues are further discussed in Tsamardinos and Brown (2008) and Strobl and Spirtes (2016) and references therein where False Discovery Rate (Benjamini and Hochberg, 1995) approaches are discussed.
As a side remark on the previously discussed OCE and PCMCI causal discovery approaches, in the backwardelimination stage [Eq.( 26)] OCE tests only the significance of each of the parents X i t−τ ∈ P OCE (X j t ) conditional on the remaining parents.PCMCI (Runge et al., 2018), on the other hand, in the second MCI step tests all links again [Eq.( 29)] which makes the MCI test slightly less dependent on the sequential testing issue of the condition-selection step PC 1 since parents that have been removed in PC 1 (false negatives) are tested again in the MCI test.In particular, the false positives are as expected as demonstrated in the comparison studies in Sec.VII.

D. Computational complexity
Application areas of causal discovery methods vary in the typical numbers of variables as well as available sample sizes n.Next to the properties discussed before, an important issue then is how a method scales with dimensionality and sample size.High-dimensionality arises from the number of included variables N and the maximum time lag τ max [see Fig. 1(c)] and has at least two consequences: (1) higher computational complexity leading to longer runtimes and (2) typically lower detection power.Independence tests may also become ill-calibrated in high-dimensions.
For directly testing causal links via Definition 1 (FullCI), the computational complexity depends on the complexity of a single high-dimensional conditional independence test.In the linear partial correlation case, OLS regression scales ∼ O(n(Nτ max ) 2 ).FullCI estimated using nearest-neighbor estimators of CMI (Kraskov et al., 2004;Frenzel and Pompe, 2007), on the other hand, will scale ∼ O(n log n) regarding time complexity while the complexity in Nτ max will depend on algorithmic details such as using efficient KD-tree nearestneighbor search procedures (Maneewongvatana and Mount, 1999).
The methods PC, OCE, or PCMCI (Sec.V A) based on a condition-selection step avoid high-dimensional conditional independence estimation by conducting more tests with lower dimensional conditioning sets.Their theoretical complexities are difficult to evaluate, for numerical evaluations see Sun et al. (2015) and Runge et al. (2018), but typically they scale polynomially in time.
The other major challenge with high dimensionality is detection power as analyzed in Example VII C.

VI. PERFORMANCE EVALUATION CRITERIA
How can causal discovery methods, such as those described in Sec.V be evaluated?Typically, we want to know which method performs best on data from the kind of system we wish to study.Ideally, we would like to compare different methods on a data sample where the underlying causal truth is known or evaluate methods by experimentally manipulating a system, i.e., actually performing the do-experiment (Pearl, 2000) mentioned in the introduction which forms the theoretical basis of the present concept of causality.Since both of these options are mostly not available, an alternative is to construct synthetic model data where the underlying ground truth is known.These can then be used to study the performance of causal methods for realistic finite sample situations.

A. Models
To evaluate causal methods on synthetic data, several aspects for constructing model systems are relevant: 1. Model realism: The model systems should mimic the domain-specific properties of real data in terms of nonlinearity, autocorrelation, spectral properties, noise structure (dynamical as well as observational), etc. 2. Model diversity: To avoid biased conclusions, a large number of different randomly selected connectivity structures should be tested [including link density as well as properties such as small-worldness (Watts and Strogatz, 1998)].For example, the aforementioned forward-selection approach failed for the example shown in Fig. 11 but works for many other graphs.But also consistent methods may have biases for finite samples as studied in Runge et al. (2018).3. Model dimensionality: As studied in Fig. 16, a method may perform well only for a small number of variables 075310-14 J. Runge Chaos 28, 075310 (2018) and the performance for high-dimensional settings (large networks) can quickly degrade.4. Sample sizes: The comparative performance of different models may vary widely for different sample sizes which needs to be studied if no uniform consistency results are available.

B. Metrics
The performance of a causal method on a single realization of a model does not allow for reliable conclusions.Therefore, each model needs to be evaluated from many realizations.Then the most straightforward evaluation metric is to measure false positive rates and true positive rates for a given α as shown in the comparison studies in Sec.VII.The number of realizations should be chosen high enough since the error of a false or true positive rate r for B realizations is given by σ r = √ r(1 − r)/B.Alternative evaluation metrics that do not depend on a particular significance level but directly on the p-values are the Kullback-Leibler divergence to evaluate whether the p-values are uniformly distributed (to measure how well-calibrated a test is) and the Area Under the Power Curve (AUPC) to evaluate true positives.
Next to the true and false positives of a causal method for finite samples, another performance criterion is computational runtime, though this may strongly depend on a given implementation.

VII. COMPARISON STUDIES
In this section, we compare several common causal discovery methods in three numerical comparison studies highlighting the effect of dynamical noise in deterministic chaotic systems, autocorrelation, and high dimensionality.

A. Dynamical noise in deterministic chaotic systems
In Example 6, we studied a static example of determinism.Here, we evaluate the effect of dynamical noise in a system of coupled chaotic logistic maps: with uniformly distributed independent noise η and r = 4 leading to chaotic dynamics.Here, σ controls the amount of dynamical noise in the system.To evaluate true positive rates (correctly detecting Z t−1 → X t and Z t−1 → Y t ) and false positive rates (incorrect detections for any other variable pair, direction, or lag), 200 realizations with time series length n = 150 were generated.We compare three methods from an information-theoretic framework (FullCI, OCE, PCMCI) with convergent-cross mapping (CCM, Sugihara et al., 2012) (see also Arnhold et al., 1999;Hirata et al., 2016) as a nonlinear dynamics-inspired approach.The significance of CMIs in FullCI and OCE is tested with a nearest-neighbor CMI estimator (Kraskov et al., 2004;Frenzel and Pompe, 2007;Vejmelka and Palus, 2008) in FIG.14.Comparison of CCM, FullCI, OCE, and two versions of PCMCI on common driver system of three coupled chaotic logistic maps.The top panel shows the true graph structure.In the left and right graphs for noise levels σ = 0 and σ = 0.4, respectively, the width of arrows denotes the detection rate, gray edges depict false links (only false positive rates 0.08 shown).The center panels depict average true (black, left axis) and false positive rates (red, right axis) for different strengths σ of dynamical noise.The gray line marks the 5% significance threshold.combination with a permutation test where X i t−τ is randomly shuffled.PCMCI is implemented with the CMIknn independence test (Runge, 2018) as discussed in Example 12.We also evaluate a variant (PCMCI 0 ) where the condition on the parents P(X i t−τ ) is dropped and the only difference to OCE is the condition-selection step.CCM reconstructs the variable's state-spaces using lagged coordinate embedding and concludes on X → Y if points on X can be well predicted using nearest neighbors in the state space of Y .Note that CCM and related works only use the time series of X and Y with the underlying assumption that the dynamics of Z can be reconstructed using delay embedding.All methods were evaluated at a significance level of 0.05.For implementation details, see Appendix B 1. In the top panel of Fig. 14, we depict the true causal graph.
Figure 14 shows that in the purely deterministic regime with σ = 0, PCMCI has almost no power, while PCMCI 0 features a detection rate of 0.8 and FullCI, OCE, and CCM almost always detect the couplings.On the other hand, CCM also has the highest number of false positives around 0.2 which exceeds the significance level indicating an illcalibrated test.FullCI, OCE, and PCMCI 0 also do not control false positives well, while with PCMCI they are around the expected level.For higher dynamical noise levels interesting behavior emerges: FullCI and CCM have continuously decreasing power (and slightly decreasing false positives) dropping to almost zero at σ = 0.4, while the power of OCE and both PCMCI versions steadily increases up to σ = 0.2 after which power decreases with a more pronounced decrease for PCMCI and PCMCI 0 having the highest power.False positives are above the 0.05 threshold for FullCI and OCE for a wide range of noise levels, while for PCMCI false positives are always well-controlled at the expected 0.05.
How can these results be understood?CCM attempts to reconstruct the attractor manifolds underlying X and Y .With more dynamical noise, this reconstruction becomes more difficult and CCM looses power.The fact that CCM does not control false positives well, especially in the deterministic regime deserved further study.FullCI suffers from the curse of dimensionality especially for higher noise levels.
The MCI test statistic for the link Z t−1 → X t estimates , in theory we have I MCI = 0 for the same reasons as in the simple deterministic model ( 19) above (and analogously for Z t−1 → Y t ).In practice, we can only measure the entropies at some coarse-grained level (here determined by the CMI nearest-neighbor parameter) and the deterministic dependency is never exactly recovered leading to a non-zero MCI.In OCE and PCMCI 0 , only the parents of X t are included in the condition.The fact that FullCI, despite conditioning on the whole past, also detects the link deserves further study.Possible explanations are an ill-calibrated test or dynamical properties of the logistic-map system.In summary, purely deterministic dynamics here seem to generate too little momentary information which is necessary for informationtheoretic coupling detection with MCI (see also Pompe and Runge, 2011).
Given at least some dynamical noise to suffice the Faithfulness condition and together with the other assumptions discussed in this paper, OCE and PCMCI provably converge to the true causal graph in the limit of infinite sample size (Runge et al., 2018;Sun et al., 2015, see also Sec.V B), while no such theoretical results is available for CCM.In the infinite sample limit also the inflated false positives of OCE due to time-dependent samples vanish.For finite samples, on the other hand, among other factors, consistency depends on how well-calibrated the significance test is.PCMCI here always yields expected levels of false positives.The inflated false positives for FullCI and OCE for a wide range of noise levels and for PCMCI 0 for very low dynamical noise is related to the way significance testing is implemented.In theory, OCE should have less false positives than the expected 0.05 due to the sequential testing problem (Sec.V C), but in our examples autocorrelation and the global permutation scheme leads to illcalibrated tests with inflated false positives in each individual test, see Examples 12 and 13, leading to an overall higher false positive rate.The effect of autocorrelation is evaluated further in the next example.

B. Autocorrelation
In Fig. 15, we evaluate the previously introduced causal algorithms (Sec.V A) FullCI, OCE, and PCMCI on autocorrelated data which is an ubiquitous feature in real world time series data.The full model setup is described in Appendix B 2. In short, here we only evaluate the false positive rates (for c = 0) and true positive rates (for c = 0) of the link X t−1 → Y t (top panel of Fig. 15) where the autocorrelation a is varied for different numbers of common drivers D Z and the coefficients b and σ Z are chosen such that the unconditional dependence stays the same, and we only investigate the effect of autocorrelation.The time series length is n = 150 and 1000 realizations were run for each model setup to evaluate false positive rates and true positive rates at an α = 0.05 significance level.
We compare partial correlation implementations (test statistic ρ) of the following tests: (1) FullCI directly tests Definition 1, For OCE and PCMCI, we assume that the condition-selection steps already picked the correct parent sets and only test the link X t−1 → Y t in the second stages of OCE and PCMCI, respectively: (2) OCE [Eq.( 26), equivalent to PCMCI 0 ] here tests X t−1 ⊥ ⊥ Y t | P Y t , where P Y t is given by the gray and blue boxes in the top panel of Fig. 15.(3) OCEpw with conditioning set as for OCE, but where all variables are pre-whitened beforehand as described in Appendix B 2. (4) OCEbs with conditioning set as for OCE, but where a blockshuffle test was used as described in Appendix B 2. (5) PCMCI tests X t−1 ⊥ ⊥ Y t | P Y t , P X t−1 [Eq.( 29)] as given by the gray, blue, and red boxes.For all approaches, the analytical null distribution of the partial correlation test statistic was used (t-test), except for the block-shuffle permutation test.
In the bottom four panels of Fig. 15, we depict results for D Z = 0, that is, the bivariate case, and D Z = 4, both for varying the autocorrelation strength a.In the bivariate case, all approaches well control the false positive rates except for OCE and (slightly better) OCEbs, which feature inflated false positive rates for very high autocorrelation.The reason is that when testing ρ OCE = ρ(X t−1 ; Y t |Y t−1 ) with the t-test, we assume i.i.d.-data, but since X is autocorrelated, this is not the case.This false positive inflation is also seen in Fig. 14.Here, pre-whitening removes this autocorrelation and block-shuffling remedies it to some extent.PCMCI and FullCI both condition out autocorrelation and well-control false positives with constant true positive levels, independent of a, while the true positive level depends on a for OCE and its modifications, even though the coupling coefficient c is constant.
For D Z = 4 false positive inflation becomes even more severe for OCE and here also pre-whitening does not help but leads to strongly increased false positive rates since univariate pre-whitening is not suitable for multivariate conditional independence testing.The PCMCI approach conditions on the parents of the lagged variable which helps to exclude autocorrelation as shown in Runge et al. (2018) and allows us to utilize analytical null distributions that assume i.i.d.data.

C. Curse of dimensionality
In Fig. 16, we evaluate the causal algorithms (Sec.V A) for high-dimensional data using the same model as in Fig. 15 described in Appendix B 2. Here only FullCI, OCE, and PCMCI are compared, all of them again based on partial correlation.
As shown in Fig. 16, FullCI severely suffers from the curse of dimensionality and the OLS-solver even becomes ill-conditioned for D Z = 32 since then the estimation dimension exceeds the sample size.For OCE and PCMCI, we again assume that the condition-selection algorithms selected the correct set of parents.Then the dimensionality of OCE and PCMCI increases only slightly and the power stays at higher levels.
The problem becomes even more severe for nonparametric tests such as multivariate transfer entropy.Another alternative to OLS partial correlation estimation are regularization techniques such as ridge regression (Hoerl et al., 1970;Tibshirani, 1996;Tikhonov, 1963), but these come with other difficulties, for example regarding significance testing.These issues are further analyzed in Runge et al. (2018).

VIII. DISCUSSION AND CONCLUSIONS
The long preceding list of theoretical assumptions in Sec.IV may make causal discovery seem a daunting task.In most real data application, we will not have all common drivers measured, hence violating the causal sufficiency assumption.Then also the Markov assumption may be violated in many cases due to time-aggregation.A conclusion on the existence of a causal link, thus, rests on a number of partially strong assumptions.So what can we learn from an estimated causal graph?Let us consider what the assumptions on the absence of a causal link are.
Remark 1.Let X be measurements of a stochastic process X.Assuming Faithfulness (Definition 5) and that all variables in X are measured without error we have that for that is, if independence is measured given any subset of conditions, then there is no direct causal link between X t−τ and Y t in G.
Note that for the practical estimation from finite data, we also need to assume that all dependencies among lagged variables in X can be modeled with the conditional independence test statistic, that is, no false negative errors occur.Nevertheless, Remark 1 rests on much weaker assumptions than the existence of a causal link since it does not require Causal Sufficiency or Causal the Markov Condition.The proof follows almost directly from the Faithfulness assumption.
Proof.Firstly, since we assume an error-free measurement process, we have that X = X.
The last relation is the Faithfulness assumption.Separation implies in particular that no direct link X t−τ → Y t exists in G.
The second set of assumptions important for causal discovery are the assumptions underlying significance testing (Sec.V C).Failing to properly take into account autocorrelation or too simple permutation schemes imply illcalibrated significance tests leading to inflated false positives beyond those expected by the significance level (see the comparison studies in Sec.VII).Next to the theoretical causal assumptions, statistical reliability of reconstructed networks is an important aspect for drawing causal conclusions.
This paper is intended to recapitulate the main concepts of time-lagged causal discovery from observational time series data and accessibly illustrate important challenges.But many more challenges exist, for example, we have not considered selection bias or issues with the definition of variables as elaborated on in Spirtes et al. (2000).We also have not discussed the topic of determining causal effects (Pearl, 2000) (causal quantification) or mediation (VanderWeele, 2015;Runge et al., 2015aRunge et al., , 2015b) ) as opposed the pure existence or absence of causal links presented here.
Our focus was on time series which make the causal discovery problem easier in some aspects (e.g., time order can be exploited), but more difficult in other aspects, especially regarding statistical testing.We have briefly mentioned the recent works based on different sets of assumptions in the framework of structural causal modeling (Peters et al., 2017), which do not require time-order.Also, many more techniques and insights from the conditional independence framework (Spirtes et al., 2000) can be utilized in the time series case.An important conclusion is that causal discovery is a very active area of research in many fields, from mathematics, computer science, and physics to applied sciences, and methodological progress can greatly benefit from more interdisciplinary exchange.

FIG. 7 .
FIG. 7. Schematic view of dependency network with pairwise dependencies and a synergistic dependency of Y on (X 1 , X 2 ) with X 1 ⊥ ⊥ Y and X 2 ⊥ ⊥ Y , but FIG. 9. Illustration of applicability of different conditional independence methods (linear and non-parametric regression-based and model-free) on different types of linear and nonlinear common driver models.Black arrows denote correctly identified causal links and dashed gray arrows indicate spurious links.The gray scatterplots with red fit line illustrate regressions of X and Y on Z and the black scatterplot the dependency between the residuals r X , r Y .The threedimensional scatterplot with red cubes in the right column depicts the CMIknn test(Runge, 2018) which is based on data-adaptive nearest-neighbor estimation (the cubes are smaller for denser regions).
partial correlation cannot fit the dependencies of the pairs (X , Z) and (Y , Z).As a result, the residuals are still correlated (spurious gray dashed link) and the causal graph is completely wrong: We overlook the links X → Z and Y → Z and get a false positive X → Y .Since here the dependencies are still additive functions, nonparametric regressions and model-free tests yield a correct causal graph.
, we have I(X ; Y | Z) = I(X ; Y | Z + Z ) > 0 even though I(X ; Y | Z) = 0.The 10. Effect of observational noise on causal network reconstruction.Shown left is the linear example from Fig.9.Very strong observational noise on Z (right panel) here leads to a vanishing correlation between X and Z as well as Y and Z.Since then the effect of Z cannot be regressed out anymore, we also get a spurious link X → Y .
FIG.12.Permutation approach to conditional independence testing.(Top left) Example sample drawn from a common driver scheme X ← Z → Y .(Top right) Permuted sample with randomly shuffled data points x, which destroys the associations between x and y, but also between x and z leading to ill-calibrated tests.(Bottom) Schematic of local permutation scheme.Each sample point i's x-value is mapped randomly to one of its k perm -nearest neighbors in subspace Z (seeRunge, 2018) to preserve dependencies between x and z.
I n (x; y), we can simply generate a large number of test statistics I n (x * ; y) where x * is a permuted version of x.But how to permute for conditional independence testing of X ⊥ ⊥ Y | Z?In the top left panel of Fig. 12, we illustrate an example scatterplot of a sample drawn from a common driver scheme X ← Z → Y where we have X ⊥ ⊥ Y | Z.If we now permute x, we get the sample shown in the top right panel.Here, any association between x and y is indeed destroyed, but we also destroyed the association between x and z.That is, for the permuted sample we have I(x * ; y) ≈ 0 and I(x * ; z) ≈ 0 (33) but what we actually want to achieve is I(x * ; y | z) ≈ 0 (34) with I(x * ; z) ≈ I(x; z) (35) in order to test the correct null hypothesis.The above global permutation scheme results in inflated false positives as, for example, shown in Runge (2018) and for FullCI and OCE in the numerical comparison studies in Fig. 14.
FIG. 15.Comparison of FullCI, three versions of OCE, and PCMCI under strong autocorrelation.(Top panel) Model time series graph with labels denoting linear coefficients.The full model setup is described in Appendix B 2. FullCI has the conditioning set X − t = (X t−1 , . . ., X t−5 ), OCE has conditioning set depicted by gray and blue boxes, and PCMCI has conditioning set depicted by gray, blue, and red boxes.(Bottom panel) False positive rates and true positive rates are shown for two different dimensions D Z and various autocorrelation strengths a.

FIG. 16 .
FIG. 16.Comparison of FullCI, OCE, and PCMCI under high dimensionality.The model setup is the same as in Fig. 15.Shown are false positive rates and true positive rates for different dimensions D Z and no autocorrelation (a = 0) in the model detailed in Appendix B 2.
, in the above defined time series graph, a path between two single nodes A and B is called open if it contains only the motifs → • → or ← • →.For notational convenience, we will sometimes use left-pointing arrows, while still in the time series graph all directed links are forward in time.For example, • ← • → • → • is an open path.On the other hand, if any motif on a path is → • ←, the path is blocked.Nodes in such motifs are also called colliders.If we now consider a separating or conditioning set S, openness and blockedness of motifs reverse, i.e., denoting a conditioned node by , the motifs → → and ← →, are blocked and the motif → ← becomes open.For example, the path• ← → • → • is blocked, while • ← • → ← • is open.Note that paths can also traverse links repeatedly, e.g., forward and backward.
Conversely, two nodes are connected given a set S if at least one path between the two is open.