Causality indices for bivariate time series data: a comparative review of performance

Inferring nonlinear and asymmetric causal relationships between multivariate longitudinal data is a challenging task with wide-ranging application areas including clinical medicine, mathematical biology, economics and environmental research. A number of methods for inferring causal relationships within complex dynamic and stochastic systems have been proposed but there is not a unified consistent definition of causality in this context. We evaluate the performance of ten prominent bivariate causality indices for time series data, across four simulated model systems that have different coupling schemes and characteristics. In further experiments, we show that these methods may not always be invariant to real-world relevant transformations (data availability, standardisation and scaling, rounding error, missing data and noisy data). We recommend transfer entropy and nonlinear Granger causality as likely to be particularly robust indices for estimating bivariate causal relationships in real-world applications. Finally, we provide flexible open-access Python code for computation of these methods and for the model simulations.

Quantifying causal relationships between longitudinal observations of a complex system is essential to an understanding of the interactions between sub-components of the system and is subsequently key to building better and more parsimonious models 1,2 . In many real-world applications, we are rarely able to access or describe an underlying graphical network of these interactions a priori, and we are typically limited to observing simultaneously recorded variables from each subsystem as a multivariate time series. Two key properties that are widely regarded as crucial in defining causal relationships are: that the effect is temporally preceded by the cause, and that external changes to values of the causal variable propagate to values of the effect variable and do not break the causal structure 3 . Correlation or synchronisation in these multivariate time series does not necessarily imply a causal relationship between variables, and counter-examples are easy to find 4 . Further, a lack of correlation does not imply a lack of causality, and a reliance on correlation-based measures may result in nonlinear causal relationships being obscured, e.g. Ref 5. In recent decades, various mathematical frameworks 1, 6,7 have been described to allow identification of nonlinear (and asymmetric) causal structure within complex systems, primarily driven by domainspecific applications, from diverse application areas including as statistical economics 8,9 , climate science 10-12 and computational neuroscience 13,14 .

I. INTRODUCTION
No general method exists to identify causal structure within complex systems, and there is no single consistent and unifying notion of quantitative causality estimation for time series data. Published methods can be broadly categorised into the following groups: 1. regression-based indices that use 'recent history' vectors as predictors in a model (e.g. Granger causality), 2. information-theoretic indices that build upon ideas of conditional mutual information (e.g. transfer entropy), 3. indices based on state space dynamics, such as local neighbourhoods and trajectories (e.g. convergent cross mapping), 4. graphical models that scale causal inference estimation to high-dimensional multivariate systems for causal identification.
There exist common themes between these methods, and membership within these groups is sometimes somewhat blurred. Figure 1 sets out key properties and similarities between methods from groups 1-3. Previous reviews of the literature [15][16][17][18] typically focus on a subset of methods from one of these groups. The suitability, interchangeability and performance of published methods, particularly where they span more than one of these groups, has received relatively little attention. In this work, we identified and assessed a widely used subset of indices for directed bivariate causality inference, concentrating on methods involving univariate embeddings to describe the recent history of the system (Section II.). A review of such methods has been published previously by Lungarella et al. 19 . We reproduce these results for FIG. 1. Causality indices described in this paper, which represent a widely-used but non-exhaustive subset of this field of research. The indices are as follows (where GC is Granger causality): extended GC (EGC), nonlinear GC (NLGC), predictability improvement (PI), transfer entropy (TE), effective transfer entropy (ETE), coarse-grained transinformation rate (CTIR), similarity indices (SI), convergent cross mapping (CCM). We classify these indices into three categories, and highlight commonalities between the approaches and their estimation (state space dynamics, nearest neighbour computation, kernel estimation).
the original and newer methods. We also extend this work by proposing a set of modifications that can be made to simulated data prior to causal estimation, in order to investigate sensitivity of each method to data availability, scaling, missing data, rounding and Gaussian noise (Section III.). Each of these reproduces phenomena that often occur in real-world data, such as when instruments have a fixed measurement precision and data is reported with rounding error. We believe these tests should provide in-depth benchmarking criteria for new proposed methodologies. Finally, we summarise the strengths and weaknesses of these approaches, and identify key areas of further research (Section IV.).

II. METHODS
We observe a complex system as a set of variables within a multivariate time series. The time series X = (x 1 , . . . , x T ) and Y = (y 1 , . . . , y T ) describe a bivariate system with state s t = (x t , y t ) at time t. A critical implicit assumption is that the series has unit time, or equivalently that the data is observed at a fixed constant frequency. Underpinning all methods is the key assumption that cause directly precedes effect. As a result, a preliminary step is the construction of time-delay embedding vectors x m,τ t in m-dimensional state space X ∼ = R m . An important distinction in the defined methods is whether their theoretical basis is stochastic or deterministic. Both employ the same time-delay vectors, though only in deterministic systems are the vectors x m,τ t considered elements within an mdimensional state space X ∼ = R m . We construct equivalent embedding vectors y t (and state space Y ∼ = R m ) for Y , and a joint embedding vectors z t (and state space Z ∼ = R 2m ): In practice, many real-world systems are stochastic, with some level of noise or randomness in at least part of the system. A further assumption for stochastic causality estimation is that of separability, which states that there is unique information about the effect variable that is contained only within the causal variable. The standard approach here is to describe or model the current value x t of X as conditional upon upon the 'recent history' joint embedding vector z t−1 (full model). Separability means that removing the causal variable Y eliminates the information it contains about the effect X, which we observe either by identifying non-zero coefficients in the full model or constructing a reduced model, conditioned only upon x t−1 . These methods are generally described with time index shifted t → t + 1, though the interpretation ('current' and 'recent history') remains the same.
Granger causality 1 (GC), a notable and popular method for causality estimation in time series, fits autoregressive models on the time series to this end. Extensions of GC to nonlinear systems include a locally linear version called extended Granger causality 20 (EGC) and nonlinear Granger causality 21 (NLGC), which performs a 'global' nonlinear autoregression using radial basis functions (RBFs). Predictability improvement 22 (PI) is another locally constant linear regression of 'recent history' embeddings, which measures a reduction in mean squared error when z t is used for predicting a 'horizon' value x t+h instead of x t alone.
Information theory is a natural framework for describing causal relationships. Transfer entropy 7 (TE) measures deviation from the generalised Markov property p (x t+1 |x t ) = p (x t+1 |x t , y t ) as a conditional mutual information. With weak coupling and limited data, transfer entropy can suffer from finite sample effects and effective transfer entropy 23 (ETE) corrects for this using shuffled versions of the causal variable. TE reduces to vanilla GC under the assumption of Gaussian variables 24 (⋆, Figure 1), and non-zero GC implies violation of the generalised Markov property and non-zero TE 25 . Coarse-grained transinformation rate 26 (CTIR) is based upon 'coarse-grained entropy rates', and measures the rate of net information flow, averaged over different lags τ. Often the difficulty in information theoretic methods (described in depth in Ref 15) is the robust estimation of joint probabilities or entropy values, which in turn form building blocks for these methods. We use a histogram binning partition (H) and the (hypercube) Kraskov-Stögbauer-Grassberger (KSG) estimate 27 , which is a technique involving k-nearest neighbour statistics. All information theoretic computation here is in 'nats' (logarithm base e). Fully deterministic dynamical systems, which evolve according to a differential equation or difference equation, do not necessarily satisfy the separability condition. In these systems, x t can often be reformulated as a function of only past values of X, which makes the potential causal role of Y in the coupled system less clear, as highlighted by Granger 1 . Causal relationships in a coupled deterministic system are instead observed via the event that each variable belongs to a shared attractor manifold A ⊂ Z . A consequence of Takens' embedding theorem 28 is that the 'library of historical behaviour' of X preserves the topology of A and, by transitivity, local neighbourhoods in X those in Y and vice versa 5 . It is possible to detect unidirectional causal influence, where only the dynamics of a causal variable propagate to the response variable in this way. Sugihara et al. 5 argue that the inferred direction of unidirectional causal influence is counter-intuitively reversed (i.e. cross mapping from X to Y reveals causal influence from Y to X).
The key assumption of cross mapped indices is that causal relationships are observed in the similarity between sets of (subscript) indices denoting the nearest neighbours for each set of embedding vectors, which can be mapped from one variable to the other to reveal interdependency. This is the idea behind the similarity indices: two similarity indices we test here, denoted SI (1) Y →X and SI (2) Y →X , are H(X |Y ) in Ref 29 and E(X |Y ) in Ref 30 respectively. Convergent cross mapping 5 (CCM) computes the correlation ρ between the cross mapped estimate and the true value, with convergence in ρ as T increases "a key property that distinguishes causation from simple correlation" 5 .

III. RESULTS
Our results are split into two parts. First, we reproduce the results from Ref 19, evaluating the performance of all methods including the additional CTIR and CCM, plus ETE using histogram binning and TE using KSG. In these simulations, we choose the same simulation model parameters and causality index parameters as in Ref 19 (Table I). In the second part, we investigate sensitivity to common issues relevant to realworld data, using the Ulam lattice system to illustrate these.

A. Numerical simulations
We investigate performance on four simulated model systems (Table II). In each simulation, we assess the causality estimates of each method by varying the coupling strength λ . These simulated systems are widely studied in chaos theory, e.g. Ref 38, and also appear elsewhere in the literature, e.g. Ulam lattice in Ref 7. Linear process: x t+1 = b x x t + λ y t + ε x,t , y t+1 = b y y t + ε y,t (1) ε x,t ∼ N(0, σ 2 x ), ε y,t ∼ N(0, σ 2 y ) Ulam lattice: Hénon unidirectional map: Hénon bidirectional map: In general, all measures exhibit a small standard deviation relative to the absolute value of the index, indicating that random initial conditions during data simulation has at most a very small influence on the causal structure, when initial transients are discarded. Though we are able to replicate the findings in Ref 19 in most cases, we occasionally find minor differences between their results and ours. In particular, we sometimes find results of a similar profile but different magnitude, as λ varies. We observe this for: EGC and linear processes; PI and all simulations, SI (1) and Ulam lattice; TE and Hénon unidirectional maps. Though we handle numerical outliers differently in our visualisation of results for Hénon bidirectional maps, our results for these simulations appear largely comparable in magnitude and profile. There is no mention of a data standardisation step in Ref 19 and the results we report do not involve any pre-processing, though this did not appear to rectify these differences. In one notable difference between identical Hénon bidirectional map results, Lungeralla et al. 19 find a region in λ -space (namely {(λ xy , λ yx ) : λ xy > 0.1, λ yx > 0.1, λ xy + λ yx < 0.35}) in which they identify general synchronisation between X and Y and have difficulty estimating indices due to numerical instabilities, yet we do not observe this. We have followed the implementation in Ref 19 as closely as possible and it is unclear why these differences exist.
We knowingly deviated from the implementations in Ref 19 only in the case of NLGC, in which we preferred to use kmeans rather than fuzzy c-means clustering to determine RBF centers, after finding similar or improved results but with a much reduced computational cost. Lungarella et al. 19 note that NLGC is numerically unstable for 'small' T and computationally expensive for 'large' T , which we suggest may be Brief summary of the characteristics of each numerical simulation model system and parameters (equations 1-4). The difference between identical and non-identical Hénon bidirectional maps is the value of b y (b y = b x for identical maps and b y < b x for non-identical maps). In each simulation, the first 10 5 iterations were discarded as transients (10 4 for linear process). Each simulation is initialised randomly but seeded for reproducibility. The coupling parameters were incremented by 0.01 in all cases, for each of 10 independent runs and all indices. We use the following abbreviations in this table: I -identical maps; NI -non-identical maps; L & S -linear and stochastic; NL & D & Cnon-linear, deterministic and chaotic.

Simulation
Coupling Dynamics T = 10 p Simulation model parameters Coupling strength Linear process partly due to their use of fuzzy c-means. Little detail is provided about their implementation of this but it may perhaps be that an early stopping criteria sometimes forces a 'poor quality' clustering. Further, we found that the performance of NLGC in Hénon bidirectional map simulations improved significantly with a different set of NLGC parameters (e.g. P = 50 instead of P = 10), though we do not present these alternate results.
For the linear process (LP), the simplest simulation model, all indices show very strong positive correlation in the Y → X direction ( Figure 3). In the reverse X → Y direction, TE and CTIR both decrease with increasing λ , the cross mapped indices all show a marked increase and the remaining indices are approximately zero for all λ . This gives rise to patterns of positive and negative correlation between pairs of methods. As each x t or y t is a sum of Gaussian variables, we can derive theoretical values for Shannon entropy and, consequently, for the information theory methods (see Supplementary Materials). Figure 3 shows that TE (KSG) reliably estimates the 'true' transfer entropy but TE (H) significantly underestimates the theoretical values. This is a fundamental flaw that undermines any other advantageous properties of TE (H). Though computed CTIR values match the theoretical values, it is clearly negative in X → Y and as such does not reflect the causal structure of the system. Increasing the size of the data T alters the value of TE (H) here, but TE (KSG) remains accurate as T increases. However, in all other simulations, TE (H) is more robust to increasing data size.
The Ulam lattice (UL) chains together unidirectional coupled chaotic Ulam maps. For large N L , the causal influence from Y to X is negligible. UL exhibits synchronisation for λ ≈ 0.18, 0.82, where cause and effect variables are indistinguishable from each other, e.g. the system converges to a two state attractor. As a result, most indices either have values approximately equal to zero or suffer from high variance numerical instabilities. Outside of these regions of synchronisation, the information theoretic methods and regression based indices show reasonable consistency (Figures 2 and 4). The exception is CTIR, which slowly decreases as λ increases for T = 10 5 , albeit still correctly identifying the direction of in-formation flow. ETE (H) successfully corrects for the small sample effects that give rise to these spurious positive TE (H) results in the Y → X direction when T = 10 3 . Both similarity indices fail to identify any causal structure in the UL. For CCM, whilst the net directed index D X→Y increases with λ , it is negative for λ < 0.5, and so misidentifies the direction of causality. The positive correlations between methods in i Y →X for T = 10 5 occur due to a very slight peak in value at λ ≈ 0.5 for nearly all methods (except CTIR and EGC).  3. Linear Gaussian processes with T = 10 4 data points and unidirectional (Y → X) coupling. Error bars report ±1 empirical standard deviation from mean values, after 10 independent simulations from the LP system. Simulation parameters are given in Table II and parameters for each causality index are given in Table S NLGC and CCM, though we found that setting a larger number of RBFs resulted in better performance for NLGC. For CCM, the direction of causality is sometimes incorrect and the reason for this is unclear, though this may also be a result of poor parameter choices. We observe expected symmetry in the values of λ , and synchronisation in the region approximately equal to {(λ xy , λ yx ) : λ xy + λ yx > 0.28}. There are a small number of points in which numerical instabilities are present in all indices, but the consistency across all indices suggests that these are isolated points in which the system converges to some limit cycle or attractor. There are more differences between methods in HB (NI) results. Significant numerical instabilities occur in EGC, particularly when the system is in a state of synchrony: the region approximately equal to {(λ xy , λ yx ) : 0.05 < λ xy < 0.15, 0.1 < λ yx < 0.28} Outside of this region, EGC is broadly similar to the information theoretic indices, which are highly correlated, and to a slightly lesser degree with SI (1) and the PI. In contrast, NLGC, SI (2) and CCM have unusual results. The first of these is negative almost everywhere (even in a repeat analysis with more RBF kernels) and the latter two are mostly non-negative, and moreover the regions with the most extreme values occur in quite different places in all three.

Computational burden
An important consideration in selecting a suitable method is any trade-off between performance and computational efficiency. The most significant factor in this is often how the algorithmic cost of each method scales with increasing data size T . Table S.II shows the mean and standard deviations of the time taken to compute each index and simulation. TE (H)/ETE (H) is the fastest in almost all cases, even though this calculation also includes 10 reshuffles and recomputations for ETE (H). CCM, EGC and TE (KSG) are similarly among methods with smaller computational cost. Several methods have extreme values in UL simulations with T = 10 5 , particularly CTIR and PI, but this is distorted by difficulties in computation when the system is in a synchronised state. We observed a marked difference in computational cost for NLGC when using k-means for clustering instead of fuzzy c-means  Table II and parameters for each causality index are given in Table S.I. Due to extreme results in the EGC index when the system exhibits synchrony (λ > 0.7 for HU), we set these values set to NA and repeat the plot (EGC * ). We set these limits as ± max(|p 1 |, |p 99 |) for identical maps and ± max(|p 5 |, |p 95 |) for non-identical maps, where p i is the i th percentile of results for that index. Simulation parameters are given in Table II and parameters for each causality index are given in Table S.I. and we suggest that k-means is more suitable here. Our computation was done in a high performance CPU computing cluster using SkyLake 6140 with 18 core 2.3GHz processors and 384GB of RAM. Although the computational times we report may be slightly faster than on a laptop computer with less processing power, we did not observe any substantial difference when internally comparing run times.

Real-world relevant data issues
Next, we investigated the sensitivity of all causality indices to a number of modifications mirroring issues that often arise in real-world data. We choose UL with T = 10 3 to illustrate the effects of these transformations, as LP is too simplistic a model to give sufficient insight. We keep all simulation parameters and causality index parameters the same. In Table III, we summarise the meansμ and standard deviationŝ σ of the directed indices D X→Y for each λ , which are normalised by their deviation from the 'base' UL results and averaged over all λ . This normalisation allows us to compare across methods that take values in different ranges.

Data availability
Many of the indices, with the exception of TE (KSG) and CTIR, remain consistent with increasing data size T , whilst at the same time exhibit decreasing variance. These results reinforce the similar observations in HU maps. The large increases in the value of the two methods mentioned is concerning and represent a drawback of both methods that should be acknowledged in applications of these approaches. It is unclear whether there is convergence to some 'correct' value as the amount of data increases or whether both are unbounded as T → ∞, but initial computations do not support the former (not shown). Though the D X→Y values from both transfer entropy methods are highly correlated, they are both estimates of the same quantity and it is difficult to reconcile their different magnitudes, particularly as we have already seen significant underestimation in TE (H) for LP simulations.

Standardisation and scaling
In the second set of experiments, we perform three tests: standardising both series by their sample mean and standard deviation in the first, and separately scaling each unstandardised time series by a factor of 10 ( Figure S2). For the Ulam lattice system, sample means for both X and Y are typically between 0.4 and 0.7 and standard deviations are both approximately equal to 1.2 (except when the system is in synchrony). Several methods are invariant under linear scaling or shifting of the original series X and Y , including cross mapping approaches. Information theoretic measures are also invariant in theory, but the KSG algorithm, based on k-nearest neighbours, does not retain this property. Similarly, EGC relies on a neighbourhood size parameter, and scaling the data without changing this parameter accordingly can result in insufficient points available for the locally linear regressions, as is observed when either X or Y is scaled by 10. The directed index for both NLGC and PI has vastly inflated magnitude when Y is scaled by 10. With this in mind, we recommend standardisation or normalisation of the data before employing these methods.

Rounding error and missing data
We perform three experiments to investigate rounding error, first rounding each time series separately to 1 decimal place and then rounding both to 2 decimal places ( Figure S3). TE and both GC extensions have similar performance to the baseline in all cases, whilst CCM suffers the most. In two experiments with missing data of 10% and 20%, all methods appear robust to this.

Noisy data
In the case of the earlier LP simulations, Gaussian noise forms an integral component of the system itself and the theoretical expression for TE shows that this depends only on the ratio of the variances σ x /σ y (see Supplementary Material). However, this noise is inherent in the simulation process (i.e. it does not arise in observation of the system). In our UL experiments, we added additional Gaussian noise after simulation. The inclusion of this 'observation' noise does not alter the state of the system or the information flow between variables but it does obscure the causal structure. In the first of these experiments ( Figure S3), in which we added small variance Gaussian noise (σ G = 0.1), the amplitude of this noise is an order of magnitude less than the original UL values and the inclusion of this noise has a small effect for all indices. In the latter experiments, we added Gaussian noise (with σ G = 1) to each variable individually and the effect is more pronounced. NLGC performs best in general and appears very resilient to noise added to Y (effect variable), though it drops slightly in value when Gaussian noise is added to X (cause variable). It is interesting that the two SI have quite different results, with SI (1) more robust to noise, though both methods are not in general able to successfully identify the direction of causality.

DISCUSSION
In-depth comparative studies of this kind are relatively rare in the mathematical literature (examples include Refs. 17 and 39), particularly in evaluating performance of methods for estimating a concept, such as causality, that does not have a consistent, fundamental mathematical definition. Even without this, causal inference has a huge importance in how we can model, predict and exploit real-world applications from many scientific disciplines. Asymmetric bivariate causal inference is the first key step to providing this insight into interactions between components in complex networks. In the con-TABLE III. Summary of all results from experiments into the effects of data size, scaling, rounding, missing data and Gaussian noise. Taking the Ulam lattice (T = 10 3 ) as a baseline, we recompute the causality indices across 10 independent experimental runs for each λ ∈ [0, 1] as before. For each λ , we compute the mean,μ, and standard deviation,σ , of directed indices D X →Y . We subsequently compute the deviation from the baseline µ and σ , reporting the average of these over the λ values (excluding the few λ values where the system exhibits general synchronisation). We normalise deviations between µ andμ by the absolute value of µ, with f (µ,μ) = µ −μ / |µ| and g(σ ,σ ) = σ / σ , where · is the mean over λ . If the modified simulation returns the same values as the baseline, then f = 0 and g = 1. All entries except in the column for the baseline T = 10 3 report these f and g values.

Baseline
Data size Scaling Rounding Missing data Gaussian noise Method in reviewing methods drawn from diverse mathematical foundations, but we extend this review with additional methods and crucially we investigate the impact of common issues that are relevant to real-world data. In reproducing and updating their work, we are also able to resolve some computational stability issues and comment on the computational costs of each method, whilst we also make our code publicly available for other researchers to develop further.

Further work
A primary concern in causality inference is the difficulties with model misspecification, specifically causal identification in multivariate systems. Omission of confounding variables can create spurious false-positive causal relationships. There may also be redundancy across multiple variables that provide similar information to the effect variable or sets of variables that interact synergistically such that their combined causal influence is greater than the 'sum of their parts'. These are key concerns outlined in Ref 16 and, consequently, results from bivariate indices cannot be definitively interpreted as the ex-istence of a fundamental direct causal relationship between two variables 29 . A key avenue for further work is to advance this analysis beyond a bivariate setting by including possible confounding variables, in line with conditional extensions to Granger causality 9,20,40,41 and transfer entropy 42 . Recent work with graphical models of multivariate systems 2 is an important step towards high-dimensional causal identification.
Separate univariate embedding is not without some limitations and is not necessarily the optimal multivariate embedding. Aside from in Ref 43, mixed embeddings are as yet uncommon in causality estimation. There is not yet a theoretical framework for longitudinal data that is recorded non-simultaneous and irregularly. Often, a typical workflow for such data involves pre-processing to transform the data into a multivariate series with constant time intervals. However, many imputation methods result in significant and poorly quantified biases in information content and flow, which inevitably propagate through to estimates of causality, and more work is needed to explicitly factor this into a causal inference framework.

Summary and recommendations
Each causality index has strengths and weaknesses, and there is no single method whose all-round performance ex-ceeds all others. Transfer entropy and Granger causality have long been regarded as the leading methods for systems that contain a small number of variables, and these have had wide applications 8,9,14 . Transfer entropy has the distinct advantage that it is built upon the principles of Shannon entropy in a well-established and universal information theoretic framework. It performs solidly throughout, though there is some tension between algorithms for TE, with the estimates rarely in complete agreement. We have shown that a histogram fixed partition approach is biased even in the simplest model, despite TE (H) having general consistency and computationally efficiency. Therefore, we recommend the KSG algorithm for transfer entropy computation, unless perhaps data is extremely scarce (T < 10 3 ). However, there are some unanswered concerns about TE (KSG), particularly that it appears to increase in magnitude as more data is available. TE (KSG) also suffers in performance when data is unequally scaled, due to the resultant difficulties with identifying unique nearest neighbours. CTIR, whilst sometimes not wholly dissimilar in value from TE, did not seem to offer any obvious advantage to compensate for its much higher computational cost or occasional unusual behaviour. Vanilla GC is widely favoured but has restrictive assumptions and is ill-suited to complex nonlinear problems. Of the two nonlinear extensions to Granger causality, Lungarella et al. 19 appear to prefer EGC. Some of the computational challenges and numerical instability that they experienced with NLGC may have been a result of their choice of a fuzzy c-means for determining RBF kernels, and alternate parameter choices appear to resolve some of their concerns. We find that NLGC is one of the most robust methods to rounding error, missing data and Gaussian noise. They rightly note that "If the rank of the data is small, kernel based methods tend to overfit" 19 , but we did not observe any issues with this in our simulation experiments. Predictability improvement (PI) likewise performed solidly, and has a slight advantage amongst the regression based indices in that it that it is perhaps less reliant on parameter choices. Finally, dynamical systems theory offers a different insight into causal inference that should not be readily dismissed despite our mixed results here, even though the deterministic simulation models appeared to be well-suited to the underlying theory. Convergent cross mapping is a more recent and popular method, and this offered a broad improvement on the similarity indices (SI), which did not consistently identify the strength or direction of causality. However, CCM too did not always manage to determine the correct causal flow in our simulations.
We have highlighted the value of a standardisation preprocessing step in in avoiding algorithmic issues, which is also important in comparing results from different data for each method. Rounding error gives rise to practical issues within the implementation of several of the algorithms. For instance, in k-nearest neighbour approaches it is typically assumed that the distances between pairs of points are unique and not discrete. Subsequent edge cases can be treated by adding random noise with low amplitude to the data before estimating the causal relationships 27 , though propagation of this noise to final estimates is something that should be analysed. Likewise, many existing implementations of the methods are not equipped to handle missing data (e.g. Refs. 42 and 44). We believe this is broadly straightforward to implement across all indices, as it can be handled exclusively within the time-delay embedding vector step, by performing an embedding and then removing any embedding vectors missing at least one component. As Mönster et al. 37 put it, "Noise in real-world data is ubiquitous, the inclusion of noise in model investigations has been largely ignored". Added Gaussian noise leads to the biggest changes in value for most methods, particularly noise at observation in the causal variable. However, provided the magnitude of noise is small compared to the values themselves, all methods perform adequately.
On the basis of this work, we conclude that the strongest choice for identifying and quantifying bivariate causal relationships is, in our view, either transfer entropy (KSG) or nonlinear Granger causality. Predictability improvement is a reasonable alternative and perhaps the next best candidate. A more cautious approach may involve using more than one method, from different theoretical backgrounds. Where possible, it is advantageous to identify a base case for the system, which subsequent results can be reliably compared against. For new methodologies, we recommend investigation into the real-world issues we have discussed.

CODE AND DATA AVAILABILITY
Our code is openly available at the GitHub repository https://github.com/tedinburgh/causality-review and Ref 45. The data that support the findings of this study are openly available at the same repository. A CODECHECK certificate is available confirming that the computations underlying this article could be independently executed: doi.org/10.5281/zenodo.4720843.
Existing open-access code for some indices include repositories for information theory and transfer entropy: IDTxl 42 v1.1, PyIF 46 ; and for convergent cross mapping: pyEDM 44 v1.7.4. We also adapted fuzzy c-means code based on Ref 47. We checked our results for transfer entropy and convergent cross mapping against those from IDTxl and pyEDM respectively. All code in our repository and in these others is Python.

SUPPLEMENTARY MATERIALS
Our supplementary materials contains additional tables and figures. Tables S.I and S.II show full parameter choices and computational time requirements of each method respectively. Figures S1-3 show the results of real-world relevant transformation experiments. The supplementary materials also contain theoretical results for information theoretic measures in the linear process simulation.

TE is funded by Engineering and Physical Sciences Research Council (EPSRC) National Productivity Investment
Fund (NPIF) EP/S515334/1, reference 2089662. A CC BY or equivalent licence is applied to the AAM arising from this submission: this manuscript is the AAM.
We would like to acknowledge Marcel Stimberg and Daniel Nüst for their work with the CODECHECK. We found several existing open-source code repositories, listed above in Code and data availability. It was insightful to view and test these packages, though we still decided to develop our own code for these methods. In addition, we appreciate advice from George Sugihara on use of CCM (pyEDM) in email correspondence. We also acknowledge the Python community for core packages that this work depends upon, including ipython 48 v7. 16 Table S.I shows the causality index parameters for each method and simulated model system. Table S.II shows the mean and standard deviation of the computational time for each method across all simulated model systems. Figure S1 shows Transfer entropy has a closed form solution for autoregressive linear process with Gaussian noise, defined ass , the transfer entropy is: We show algebraic calculations for m = 1, τ = 1. We set c(y t , y t ) = c(y t+1 , y t+1 ) = c(b y y t + ε y,t , b y y t + ε y,t ) = b 2 y c(y t , y t ) + c(ε y,t , ε y,t ) = σ 2 c(y t , y t+1 ) = c(y t , b y y t + ε y,t ) = b y c(y t , y t ) c(x t , y t+1 ) = c(x t , b y y t + ε y,t ) = b y c(x t , y t )  Table 2 and parameters for each causality index are given in Table S.I. Due to extreme results in the EGC index (for T = 10 5 only) when the system exhibits synchrony (λ ≈ 0.18, 0.82), we set these values set to NA. G is the variance of Gaussian noise added to both variables, and σ 2 G,X is the same added only to X (same for Y ). Error bars report ±1 standard deviation from mean values, after 10 independent simulations of the Ulam lattice. Simulation parameters are given in Table 2 and parameters for each causality index are given in Table S.I.  TABLE S.I. Causality index parameter values for each simulation, which are the same as those in Ref 1. The indices are as follows (where GC is Granger causality): extended GC (EGC), nonlinear GC (NLGC), predictability improvement (PI), transfer entropy (TE), effective transfer entropy (ETE), coarse-grained transinformation rate (CTIR), similarity indices (SI) and convergent cross mapping (CCM). TE (H) denotes transfer entropy estimation using a histogram partition, and TE (KSG) denotes transfer entropy using Kraskov-Stögbauer-Grassberger estimation. Common parameters m and τ ( * ) are for all methods, except when otherwise specified in the row corresponding to a given method. CTIR is the only method that does not use these two parameters.