Recent advances in applying deep reinforcement learning for flow control: perspectives and future directions

Deep reinforcement learning (DRL) has been applied to a variety of problems during the past decade, and has provided effective control strategies in high-dimensional and non-linear situations that are challenging to traditional methods. Flourishing applications now spread out into the field of fluid dynamics, and specifically of active flow control (AFC). In the community of AFC, the encouraging results obtained in two-dimensional and chaotic conditions have raised interest to study increasingly complex flows. In this review, we first provide a general overview of the reinforcement-learning (RL) and DRL frameworks, as well as their recent advances. We then focus on the application of DRL to AFC, highlighting the current limitations of the DRL algorithms in this field, and suggesting some of the potential upcoming milestones to reach, as well as open questions that are likely to attract the attention of the fluid-mechanics community.


A. Introduction
Reinforcement learning (RL) is a branch of data-driven methods, considered as a subsection of machine learning, which is itself a subset of artificial intelligence. The first study developing the modern form of reinforcement learning was provided by Sutton 55 at the end of the 1980s. In his time, Sutton 55 merged two distinct approaches: the learning by trials and errors, notably studied for its natural applications in the learning process of animals, and optimal control. By combining the two, the aim of modern RL is to optimize the decisions taken by an agent in a surrounding environment during a sequence of actions. In other words, the idea is to train and improve a control law (the agent) able to optimize a target system (the environment) by taking sequences of actions.
Reinforcement learning has already been applied to many fields with successful results. During the past decade, RL algorithms have been enhanced thanks to neural networks, developing DRL methods on the basis of RL. One may refer to Ref. 1 for RL and Ref. 3 for DRL applied to robotics, to Refs. 8 and 9 for DRL applied to Atari games with super-human performances, to Ref. 10 for video games and Ref. 6 for Go game, to name a few.
In reinforcement learning, an agent interacts with an environment in a three-steps sequence : 1. The environment gives to the agent a partial observation o of its state s. In a continuous environment space, the observation is necessarily incomplete. Thus, the observation o may not provide enough information to describe s. In order to overcome this issue, the state s t obtained at time t is often described by the succession of actions and observations that led to it : s t = (s 0 , a 1 , o 1 , a 2 , ..., o t−1 , a t ), or similarly: s t = (s 0 , a 1 , s 1 , a 2 , ..., s t−1 , a t ).
2. According to the observation o of state s, the agent takes an action a, following a certain policy π: a = π(s). This action impacts the environment and modifies its state. As a consequence, its new state is now denoted s (a priori different from s). RL algorithm gathers information and progressively enhances the reward. To this end, the reward defines the feature that the user wants to improve. Instead of the instantaneous reward, introduced as r above (or more precisely r t at time t), RL algorithms usually optimize a cumulative reward, denoted by R and defined as:

Eventually, a reward
where γ is the discount factor (usually, γ 0.99), and T is the duration of an episode. The sequence of the algorithm is as follows: • First, at t = 0, the state of the environment is (randomly) initialized (state s 0 ).
• At each time-step t, the three steps process described above (see Fig.1) is applied, returning a reward r t after having changed the state of the environment from s t to s t+1 , by enforcing the action a t according to the policy π.
• When t reaches T , the process stops. The sequence of actions running during a time T is called an episode. At the end of an episode, the cumulative reward R is computed and the policy modified with the long-term objective of optimizing R. During an episode, T t actions are taken. In the cumulative reward, actions with a low instantaneous reward can be compensated by others, no distinction is made apart from the discounted factor γ t . This factor slightly favours short-term rather than long-term strategies, while enabling R to converge when episodes with numerous actions are considered. A full sequence (s 0 , a 1 , s 1 , a 2 , ..., s T ) is called a trajectory, and further represented by τ.
The cumulative reward is commonly not computed over the whole trajectory, but from a certain time-step t 0 . Equation (1) is then replaced by: 5 Deep reinforcement learning for flow control: perspectives and future directions Depending on the representation chosen for the environment, RL algorithms are ordinarily separated into two broad categories. If the agent interacts with an artificial modelling of environment, the algorithm is considered to be model-based, otherwise the algorithm is model-free.
In section II B, examples of model-based methods are presented, but since the majority of DRL algorithms are model-free, attention will mainly be given to the model-free category in this review.
Additionally, the reader can refer to the works of Jaderberg et al. 56 and Ha and Schmidhuber 57 , or to the recent review of Rabault and Kuhnle 58 , for examples of model-free algorithms incorporating a model-learning process. By trying to build a representation of the environment during the learning process and benefit from it, these initially-model-free methods are at the cross-section of model-free and model-based algorithms.
In the following subsections we will only discuss the main concepts and features of RL algorithms, without a focus on DRL. How NNs can be used together with RL algorithms to develop DRL will be discussed in section III.

B. Model-based methods
Model-based methods require an artificial representation of the environment. Historically, linear models first spread across the field of flow control, as they were relatively simple and proved to be efficient on multiple problems, e.g the stabilization of convectively unstable flows [59][60][61] . But flows are also commonly characterized by their strongly nonlinear behaviors. Although linear models are justifiable in turbulent situations where the flow behaves like a linear amplifier 62 , their application to highly turbulent regimes is ambiguous and has long been debated in the community (see, e.g., Ref. 63  The computational gain obtained with model-based methods results from the simplification of the environment. This gain is especially relevant in the situation of flow control, but necessarily leads to sub-optimal solutions when it comes to high-Reynolds-numbers regimes, as the models only partially approximate the turbulences. Hence, models can still be configured for these ranges of regime, but one is then confronted to a mediation between sub-optimal complex modelbased controllers that rely on necessarily approximate models of the flow, and computationally-

C. Model-free methods
Model-free methods are increasingly popular in the community because of how relatively simple they are to apply. Indeed, the agent directly interacts with the environment: no assumption has to be made on the environment modelling. Thanks to regular partial observations of the environment state, the agent finds a control sequence (a 0 , ..., a t , ..., a T ) that maximizes the cumulative reward.
The environment is usually considered as stochastic (i.e. when the action a is applied on the state s, the new state cannot be predicted and deduced exactly) and is commonly modelled by a Markov decision process (MDP) 20,69,70 , even though it could be represented by other stochastic processes 38 , or considered to be deterministic 71 .

Markov Decision Process
Markov decision processes (MDPs) 69,70 are at the core of modern reinforcement learning 20 .
A MDP is a tuple (S, A, P a , R) with S and A the sets of (respectively) all the possible states and actions, and R the set of all the possible rewards. One can define P a (s, s ) as the probability that taking action a in state s will lead to state s . Reusing the standard nomenclature proposed in 7 Deep reinforcement learning for flow control: perspectives and future directions Ref. 22, let us define the operator P as: (s, a, s ) → P(s |s, a) = P a (s, s ) .
Hence, P represents the probability of obtaining state s after having taken action a in state s.
The policy is defined by: Note that π θ (s) returns the actions probability distribution when the initial state is s. The action is then chosen by the agent according to π θ (s), and θ represents the set of parameters that characterizes π. The aim is to find the policy π that maximizes the cumulative reward R.
Modifications of the policy are made by choosing different sets of parameters θ . Furthermore, P and π θ completely describe the fundamental process of the RL algorithm: given a state s, the agent takes the action a according to π θ (s), and then P returns the probability of obtaining a new state s .

State value and state-action value functions
In addition, three functions are usually defined in order to describe and classify the RL algorithms. The state-action value function, also called the Q-function, is defined by: and the state value function is defined by: Hence, Q π (s, a) corresponds to the expected cumulative reward when the action a is taken in the initial state s and then follows the trajectory τ according to the policy π, whereas V π (s) corresponds to the expected cumulative reward when the initial state is s and then follows the trajectory τ according to the policy π. Thus: 8 Deep reinforcement learning for flow control: perspectives and future directions Eventually, the advantage function A π is defined by: A π characterizes the advantage (in terms of cumulative reward) of taking action a in state s rather than taking any other possible action.
Among the model-free RL algorithms, two main methods are both developed for their promising performances: value-based 72 and policy-based methods 73 . During the RL process, the agent takes actions according to the state of the environment by following a policy. With policy-based methods, the aim is to directly optimize the policy π (see section II C 6). Differently, value-based methods rely on learning the Q-function, and then deriving the optimal policy from it (see section II C 5).

Closed-loop and open-loop methods
Additionally to the value/policy criterion, RL methods are also commonly distinguished according to the frequency of the interactions between the agent and the environment. In closedloop control, the agent obtains a regular feedback of the current state of the environment thanks to sensors. The RL algorithm uses these measurements to adapt its policy during the episode, in order to tend to a more desired state (see Ref. 21 for a review on closed-loop turbulence control).
On the other hand, open-loop methods assume that the decision policy does not rely on the state of the environment, or at least that a steady or periodic actuation can modify the principal dynamical processes of the environment 74 . One may refer to the recent works of Shahrabi 75 and Ghraieb et al. 35 for the use of open-loop methods in flow control. By not considering the surrounding environment, these methods are generally less efficient than the closed-loop solutions, but easier to implement. The upcoming sections will mainly focus on closed-loop algorithms, as they are the most promising and challenging methods for the future.

On-policy and off-policy online and offline methods
Another distinction among RL algorithms is made between on-policy and off-policy methods.
Within the former, the agent solely learns about the policy it enforces to the environment, contrarily to the latter where the agent can learn from other policies. For instance, Degris, White, and 9 Deep reinforcement learning for flow control: perspectives and future directions Sutton 76 execute a second (behavior) policy aimed at choosing the trajectories while the first one decides the actions. Another example of off-policy method is the Q-learning process 77 mentioned in the next section. Furthermore, with on-policy methods, the data collected before the last update of the (single) policy cannot be used for forthcoming updates, as these data had been generated by a policy that differs from the last version. Therefore, off-policy settings are of great interest, as they enable the algorithm to reuse previous experiences that may be stored from either earlier in the learning process or completely separate learning runs, and to learn multiple tasks in parallel thanks to the different policies.
Offline RL algorithms learn and update their policy(-ies) only at the end of an episode, or after N episodes, while online algorithms make updates 'in real time'. Online strategies present the advantage to learn from the current state of the environment instead of past situations, but require demanding computational properties 78 , not always available nor achievable. The reader may refer to the work of Degris, Pilarski, and Sutton 79 and the references therein for additional information on the differences between offline and online methods.

Value-based methods
Value-based methods rely on the Q-function. The optimal Q-function can be defined as Q * = max π Q π . In reinforcement learning, the objective of value-based methods is to have a Q-table filled with the values of Q * (s, a) for any (s, a) in S × A. In other words, the aim is to be able to know the best policy to apply when starting from any tuple (s, a) in S × A, in order to maximize the expected cumulative reward. The Q-table can be thought of as an array or a transition reward matrix in discrete states case, or can be approximated by a NN or another function approximator in general. For simplicity, in this section Q is considered as an array that can be filled with values, while an ANN trained by gradient descent will be considered in deep Q-learning (see section III B 1).
It is not possible to directly compute Q * , contrarily to Q π . Thus, the Q-table is first filled with values of Q π , estimated by exploring the space S × A. Additionally, Q * respects the Bellman optimally condition 80 : 10 Deep reinforcement learning for flow control: perspectives and future directions Here, r(s, a) represents the instantaneous reward returned when action a is taken in state s. This relationship on Q * means that when the action a is first applied to the initial state s, leading to the state s (as it is a stochastic process, s is not known precisely, hence the expectation on s ), if the optimal Q-function is known at the next time-step, then the best policy is to choose the next action a that maximizes Q * (s , a ).
To update the Q-table, Eq. (9) inspires the recursive update relationship given below: It has been proved that updates of Q π by following (10) make Q π converge to Q * 80 .
Eventually, when the Q-table is filled with the estimated values of Q * (thanks to the recursive process given by Eq. (10)), and given a state s, one can then optimize the action that the agent will take, by choosing the action a * according to the 'greedy' policy: These value-based methods are usually called Q-learning 77 . One can refer to the original work of Watkins 72 , which was already an evolution of difference learning 55 .

Policy-based methods
The second main category of RL methods directly relies on an estimation and an optimization of the policy function π θ , and does not consider an optimization of the Q-function. These policybased methods have been developed since -at least -the work of Williams 73 . The objective function J is introduced in order to evaluate the quality of the policy. As the long-term aim still remains to maximize the expected cumulative reward, J is commonly defined by: J can vary thanks to modifications of θ , the set of parameters characterizing the policy. Thus, the objective is to find: To that purpose, the gradient of J is introduced, so that θ can be recursively modified by following the direction of this gradient: 11 Deep reinforcement learning for flow control: perspectives and future directions with λ being a positive parameter. Two distinct options are discussed for the evaluation of ∇ θ J, whether the policy is considered to be stochastic or deterministic, i.e. whether π(s) returns a probability distribution of actions a or an exact and unique value a. In the former case, the gradient is computed according to the relationship 22,73,81 : In this stochastic approach, the ascent of J consists in first sampling the policy on different trajectories, then computing an average of the cumulative rewards obtained on these trajectories (the average thus replaces the expectation in Eq. (15)), and finally adjusting θ with Eq. (14), in order to progressively enhance the cumulative reward.
In the deterministic approach (deterministic policy-gradient algorithm, or DPG algorithm 71 ), the environment is still represented with a stochastic process (MDP), but the policy is considered to be deterministic. To keep in mind the difference, the deterministic policy is now represented by µ θ . Replacing π θ with µ θ , the objective of the deterministic policy gradient algorithms is to apply the same principle as in the stochastic case. The mathematical formulation of the deterministic policy gradient is beyond the purpose of the present work. One may refer to the original work of Silver et al. 71 , and bear in mind that with a deterministic approach it is not needed to sample over a range of actions and states, but only over a range of states, as the actions are deterministically deduced.

D. Limitations of the classical reinforcement-learning methods
Classical RL methods, i.e. algorithms that use simple function approximators such as matrices, polynomials or other simple functions, may encounter limitations. Historically, applications of value-based methods to real-world problems 82,83 have rapidly been overtaken by faster policybased methods (see, e.g. the policy gradient method 84 ), which quickly proved their efficiency and applicability to various real-world tasks [14][15][16] .
Through the idea of experience replay (see the original work of Lin 85  In this section, a short introduction to artificial neural networks is first provided. An overview of the main DRL methods -presented in light of their RL precursors -is then suggested. Within these methods, the most recent and encouraging developments are presented, as well as the challenges they face and try to overcome. To do so, these parameters are commonly tuned step by step, by following, e.g., the back-propagating algorithm 87 .

A. Artificial neural networks
ANNs can be used in RL algorithms for their ability to estimate complex non-linear functions 12 .
The number of NNs and the functions they approximate within the RL algorithm define several different DRL methods, which we analyze and compare in the following paragraphs.
14 Deep reinforcement learning for flow control: perspectives and future directions B. Value-based methods

Deep Q-learning
Artificial neural networks are commonly known for their ability to deal with high-dimensionality and non-linearity, following their universal approximator properties, as previously discussed.
Therefore, instead of filling a Q-table with the values of Q π -as it was done in value-based RL -, ANNs may be appropriate to directly approximate Q * .
The approximation of Q * by the neural network is represented by Q θ , with all the weights and biases of the neural network represented by the set θ . The objective of the training phase of the neural network is to tune θ in order to achieve the best approximation of Q * . To do so, a loss function is defined as the mean-squared error of the Bellman equation (Eq. (9)): The variance of r(s, a) + γmax a Q θ (s , a ) being independent of the weights θ , the second term is usually ignored 9 , and the loss function limited to: More precisely, at each time-step i: The objective of the neural network is to minimize the loss function, thus to modify Q θ i to get it closer to the target r(s, a) + γmax a Q θ i−1 (s , a ), computed with the weights θ i−1 of the precedent update. To this end, the update: is applied (with α a positive parameter) and should make Q θ converge to Q * .

15
Deep reinforcement learning for flow control: perspectives and future directions However, a naive coupling -such as the one mentioned above -between an ANN and a RL algorithm would be compromised by two issues that may lead to divergence. First, the process of learning from sequences of actions and observations on-policy induces correlations. Indeed, the algorithm is potentially biased if it solely learns from the successions of tuples (s t , a t , s t+1 , r t ) that occur during the running episode (on-policy), because these tuples are generally not independent within an episode, which has a tendency to make the NN training fail. Furthermore, a minor modification of the value of Q may change significantly the data distribution and thus may lead to an unstable algorithm.
In order to tune these two biases, Mnih et al. 8 suggest the integration of a replay buffer in their algorithm. Filled with a certain amount of tuples (s t , a t , s t+1 , r t ) that occurred in past episodes, the replay buffer enables the ANN to have more uncorrelated input data and, simultaneously, a more stabilized data distribution, which is required for a correct learning process. By adapting the expe- instead of a uniform process, as the learning phase could benefit from these unusual events.
Even with the DQL algorithm, a major restriction remains, due to the fact that the neural network is chasing a moving target. As developed above, the ANN tries to reach the Bellman optimally condition (Eq. (9)) by minimizing the loss function L. To do so, it modifies Q θ i to get it closer to the target r(s, a) + γmax a Q θ i−1 (s , a ), but the target itself changes at each timestep i.
This issue induces instability and divergence in many situations, notably when the estimation of Q is provided by a nonlinear function approximator 89 , such as a FCANN. This instability thus highly reduces the algorithm robustness.

Deep Q-networks
Developed by Mnih et al. 9 , Deep Q-networks (DQNs) are an evolution of DQL. The DQN method mainly improves the ability of the neural network to treat high-dimensional environment spaces and suppresses the issue of divergence detailed above, by providing an innovation to the DQL algorithm.
The problem of divergence is solved by exploiting a second ANN, a 'slow-reacting clone' of 16 Deep reinforcement learning for flow control: perspectives and future directions the original one. Let us define the loss function with Eq. (17). In DQL, the algorithm tries to equalize the target r(s, a) + γmax a Q θ (s , a ) with Q θ (s, a), by using a single neural network. In DQN, an ANN is applied for the computation of Q θ and the 'slow-reacting' copy is applied for the estimation of the target. Denoting θ as the set of weights and biases of this second network, the loss function is then defined by: The strategy suggested by Mnih et al. 9 is to regularly equalize θ and θ . If such an equalization is made at each update of θ , then the method becomes identical to DQL. Different strategies have been developed to correlate θ and θ . In Ref. 90, the authors develop a different method (called DDPG and detailed below), but use the same idea, and propose the iterative process In the original DQN algorithm 9 , θ is actualized one time over N updates of θ (N ∈ N * 1). These strategies with two ANNs thus eliminate the instability by slowing down the movements of the target.
Dueling DQN 91 is an evolution of DQN in which the neural network is split into two streams, one aimed at evaluating the advantage function, while the other one computes the state value function V π . The Q-function is then estimated by combining the approximations of A π and V π (in a non-trivial manner). The advantage function depends on the actions chosen whereas the state value function does not. By taking advantage of these two functions and their different dependencies on the actions, the dueling DQN algorithm overcomes the DQN method particularly in situations where there are many similar-valued actions.

Double deep Q-learning and double deep Q-networks
Double DQL and double DQN 92,93 have been developed in order to solve the issue of overestimation happening in DQL and DQN, a problem partially unknown or ignored up to then. The overestimation in these algorithms is due to the fact that in the computation of the target, the same neural network is used both for the selection of the action and for its evaluation. Indeed, in DQL, one can define the target Y as: 17 Deep reinforcement learning for flow control: perspectives and future directions and likewise in DQN: Y (s, a) = r(s, a) + γmax a Q θ (s , a ) = r(s, a) + γQ θ (s , argmax a Q θ (s , a )) .
In both cases, the neural network that estimates Q * is used to choose a and then to apply Q * (s , a ), which can lead to an overestimation of the expected cumulative reward. Not only is overestimation quite common, but it can also adversely affect the performances of the algorithm 93 .
For the DQL algorithm, a new method with a second network is presented in Ref. 93, inspired by previous works 92 . In double DQL, two networks are exploited so that one focuses on the choice of the action, and the other one on the evaluation. Hence, the target becomes: According to van Hasselt, Guez, and Silver 93 , the performances of the value-based DRL methods are enhanced, even the already robust performances of the DQN algorithm in Atari games 9 .

Challenges and openings
The DQN algorithm has first been developed to stabilize the DQL method, and notably enhanced its efficiency to deal with large neural networks. Thanks to that, Deep Q-Networks have shown great results with high-dimensional observation spaces 9 . But DQN or even double DQN algorithms are not easily able to deal with high-dimensional action spaces. Indeed, as they seek argmax a Q θ (., a) at each timestep, increasing the dimensionality of the actions space highly reduces their performances. Some methods have been developed to overcome this issue, making continuous action spaces possible. Based on a combination of policy gradient and Q-function, they also offer larger possibilities, as they may consider either off-policy or on-policy strategies, whereas the methods mentioned above are limited to off-policy strategies 94 , due to the replay buffer. These promising policy-based methods are presented in the next section.

18
Deep reinforcement learning for flow control: perspectives and future directions C. Policy-gradient and actor-critic methods

Deep policy gradient
Amidst the current most promising DRL methods, the second main contender is the family of actor-critic methods. These are obtained by combining elements of Q-learning and policy-gradient methods, which we will discuss now.
The deep policy gradient algorithm is a policy-based method, i.e it directly optimizes the policy (see section II C 6). Explicitly, an ANN approximates the policy π θ , with θ representing its set of weights and biases. The loss function is defined by: so that ∇ θ L(θ ) = −∇ θ J(θ ) (see Eq. (15) for ∇ θ J). This expected value is approximated by an average over a set of trajectories. R(τ) is commonly replaced with the advantage function A π θ , as it reduces the variance of the expectation 95 and thus reduces the number of trajectories necessary to well average this expectation. Hence, in the deep policy gradient algorithm, the loss function: is usually considered. Nonetheless, as the expectation is averaged over a set of trajectories at the end of an episode, the algorithm does not differentiate between the most accurate actions and the worse, but only computes the reward obtained from the whole trajectory. Actor-critic methods adapt the deep policy gradient algorithm to deal with this issue. Thus the actions space is explored thanks to a perturbative method:

Advantage actor-critic methods
with N 1 and N 2 two different noisy processes, θ the set of weights and biases of the actor, and

Proximal policy optimization
The proximal policy optimization (PPO) algorithm is an on-policy actor-critic method, developed by Schulman et al. 97 in order to deal with the lack of robustness of the DQN 9 and 'vanilla' policy gradient 94 methods, and simplify the efficient-but-complex trust region policy optimization (TRPO) algorithm 95 , not discussed in the present review.
Apart from the particular DDPG algorithm, the policy gradient and actor-critic methods usually aim at optimizing the loss function defined in Eq. (26), now denoted by L PG (θ ). To do so, these algorithms proceed over several epochs of gradient ascent, i.e they alternate a first phase where they fill a batch of samples (a sample being a sequence (s t , a t , s t+1 , r t ) for instance), and a second phase in which the expectation of the gradient of L PG (θ ) is averaged over the batch, and then the neural network parameters θ updated following the direction of this estimation. The idea behind the TRPO and PPO algorithms is identical, but L PG is replaced with other functions. For instance, the surrogate loss: is suggested in Ref. 97 for the PPO algorithm, with: being a factor that penalizes an update of θ that would induce an important modification of the policy (comparatively to the situation before the update, with the previous set of NN parameters θ old ). ε is a parameter chosen between 0 and 1, and clip(r t (τ), 1 − ε, 1 + ε) returns r t (τ) if its value is in [1 − ε; 1 + ε], 1 + ε if it is greater than this limit, and 1 − ε otherwise. In the TRPO algorithm, the objective function is similarly penalized according to the size of the update 95  A similar process has also been suggested by Ong, Chavez, and Hong 112 , whereas Mnih et al. 94 implemented the same principle but with multiple CPU threads on a single machine, thereby outperforming the previous methods while offering the possibility to have stable value-based and policy-based, on-policy and off-policy efficient strategies. Indeed, an important improvement brought by the parallel agents is that it stabilizes value-based methods even without any replay buffer 94 , thus allowing more efficient on-policy methods.
Eventually, Belus et al. 113 exploit for the first time the translational invariance and the locality of the control problem in a DRL algorithm, in order to enhance its efficiency. Their work is not a parallelization process but also enables one to reach more complex problems, and it will be more extensively discussed in the last section.

A. Introduction passive and active control
Flow control strategies are usually separated in two categories 53 : passive and active methods.
The former include, e.g., riblets and vortex generators 54 and other LEBUs (large-eddy-break-up devices) 114 . They do not need a supply in energy nor a feedback from the environment. Their robustness, low manufacturing costs and global simplicity have notably enabled them to be applied in real-world systems, contrarily to the active methods which are often complex to implement concretely. But the active methods present the advantage to be generally more efficient, and in a wider range of operating conditions. For instance, while riblets used for drag reduction in turbulent boundary layers can perform a 5 to 9% drag reduction 115

B. Active flow control
Active flow control has grown at the frontier of various traditional fields, including fluid mechanics, control theories and machine learning 121 . Thus, it has inherently been confronted to terminology conflicts [121][122][123][124][125] , but has developed fruitful hybrid methods during the past two decades.
This interdisciplinary effort has indeed been carried out with the prospect of many promising applications 21,24 , such as delaying the transition past airfoils/aircraft wings 36,126 , reducing the drag coefficient past bluff bodies (see, e.g. Ref. 102) or controlling convective heat transport 127 , to cite a few.
Contrarily to the passive methods, the active control of a flow needs a supply in energy. In addition, the data used for the control can be provided by experimental results or CFD simulations.
These data can be used during the control sequence (reactive control) or only before (predetermined control). As feed-forward methods are not considered in the present review, active reactive feedback methods will simply be called feedback control thereafter (see Fig.5). In terms of ter-

Predetermined control
The predetermined control of a flow is regardless of its instantaneous state. The control sequence is determined in advance, and then applied without any feedback. It facilitates the control, since no sensors are needed and no feedback loop has to be implemented 43 .
In a numerical approach, Jung, Mangiavacchi, and Akhavan 128  Deep reinforcement learning for flow control: perspectives and future directions relatively efficient results, but they become unusable in more complex flows. Indeed, in the standard situation presented above, the relationship between the detected speed and the actuation is considered to be linear, and the optimal coefficient (α) is found by trial and error.
But this trial and error process is quite inefficient, even in the linear approximation 43 . Additionally, the drag reduction rate obtained by opposition control in turbulent boundary layers decreases with increasing Reynolds numbers 146  have a good understanding of the dynamics of the system (for a further optimization of the parameter at stake). Additionally, the time horizon cannot be extended endlessly in optimal control, the adjoint equations being unstable beyond a certain threshold 150 . Consequently, in prospect of (actively) controlling flows with high Reynolds numbers, the optimal control theory may present non-negligible shortcomings.
Data-driven methods: are promising, since they could overcome the previous flaws encountered within the opposition and optimal control methods. They take various shapes but are gathered by the same preeminence of/need for consequent data.
• The Bayesian optimization (BO) and Lipschitz global optimization (LIPO) processes are 'black-box' global optimization techniques 24 . Compared to the upcoming genetic programming and RL methods, they present the advantage to be less complex, as they are (respec-

29
Deep reinforcement learning for flow control: perspectives and future directions tively) linear and quadratic function approximators, and to require less computational time.
In the classical BO method, the function to optimize (e.g. the cumulative reward) is modeled by a Gaussian process. Given a certain data set, one can thus obtain a probability distribution of the values taken by the function at stake. In the BO iterative process, a function suggesting where to sample next is also defined. The LIPO method relies on the same basis,  This bi-dimensional showcase, inspired by the original setup of Schäfer et al. 172 , is now a generic benchmark used to test the recent developments of DRL in the field, and has therefore known great and recurrent interest [39][40][41]110 . Other studies also consider AFC problems using actuators, to control the flow separation 36,173 or to limit the skin friction in wall-bounded channels 43,44 , to cite a few.
The community now redoubles its efforts on increasing the complexity of the AFC problems handled with DRL. The first upcoming milestone consists in proving the efficiency of the existing algorithms when it comes to highly turbulent conditions. Issues of instabilities in turbulent boundary layers 174 also confirmed the efficiency of DRL (PPO) to control chaotic systems (the Lorentz attractor).
These two results are therefore encouraging with regard to the increase and control of turbulences by DRL.
A second milestone to reach is the enlargement of the situations/flows 40  with an arbitrary large number of actuators while highly limiting the increase of the training cost, and thereby give access to larger AFC problems. More generally, the study of symmetries, equivariances and invariances may enable one to consider larger and more turbulent flow conditions, but it will require a non-trivial implementation. Indeed, if the environment holds symmetries, or if the governing equations contain invariances or equivariances, the ANN will not naturally exploit them. For instance, with a 'naive' ANN, one is not sure that the agent will preserve theoretically-equivariant actions in reality 177 . Nonetheless, with sufficient training data, the ANN may learn invariant and equivariant properties by setting some constraints between the weights of the network, but it will then reduce its degrees of freedom 178,179 . Hence, if one wants a constant number of degrees of freedom and take into account equivariant and invariant properties, one may require a larger ANN, and thus more training data. To prevent such an increase, one has to support the ANN to take advantage of the invariances, equivariances and symmetries when they exist, by encoding a state or state-action reduced space in which the agent can work equivalently to the real environment. Zeng and Graham 177 encoded a 'symmetry-reductor' applied to a 1D Kuramoto-Sivashinsky environment controlled by DDPG. They found robust strategies stabilizing the chaotic system in the state-action symmetry-reduced subspace, and proved the enhanced efficiency of DDPG in this subspace in comparison with the 'fully-developed' space. Their work is an encouraging milestone illustrating how important it will be to consider equivariances, invariances and symmetries when they exist in chaotic/turbulent flow control problems, and how to do so.
Eventually, by defining strategies of cooperation between multiple parallel agents, multi-agent reinforcement learning (MARL) has recently proved its efficiency in AFC 27,44 , and may help enhancing the complexity of the flows considered.
Encouraging avenues for the raise and enlargement of complexity are thus already under study.
In addition, the aforementioned parallelization methods (see section III D, and, e.g, Ref. This ability the agent has to transfer its learning to a wider range of flow conditions proves -once more -the robustness of DRL algorithms and their advantageous properties compared to other AFC methods. Moreover, a recent work 106 has shown an interest in reducing the number of actuators while limiting the deterioration of the control sequence, exploiting the malleability of NNs.
It could thus enable one to deal with larger problems without increasing the number of actuators.
Eventually, DRL as well as machine learning in general may also enhance knowledge on physical processes where the classical mechanics predictions encounter limitations 152 , such as the behavior of chaotic systems. Taking the example of AlphaGo 6 , not only did the algorithm impress the community for its performances, but also because it helped understanding underlying structures in the game, for instance by choosing strategies commonly thought as inefficient until then. Therefore, when applied to chaotic systems, DRL methods may highlight unknown structures and discover new physical properties. The non-opposition-control solution found by PPO in Ref. 41 goes in this direction. In other words, DRL methods may help enhance our knowledge on physical systems 58 , meanwhile our current knowledge needs to be exploited to encode better algorithms, taking into account symmetries and invariances for instance.
Hence, data-driven methods, and more particularly DRL algorithms, have already demonstrated unrivalled performances in active flow control problems. They now have to handle larger and more complex situations, as the latter would raise even more engineering applications and scientific knowledge.
As a last comment, we would emphasize the importance of open-sourcing the codes of the DRL algorithms applied to flow control problems. As it has been developed in this review, fruitful methods may emerge from combinations of existing algorithms. For a reproducibility purpose, and because technical implementation choices may have a significant influence on the results, sharing the training data alongside the code and the theoretical details seems to be essential 58

V. CONCLUDING REMARKS
In the present review, a general overview of the main RL concepts and of the DRL framework has first been given. The potential of DRL for the active control of a flow has then been 34 Deep reinforcement learning for flow control: perspectives and future directions highlighted -comparatively to either reactive or predetermined other methods. Not only may the DRL effectiveness be useful in multiple engineering applications, but it could also help science in general. Indeed, this exhaustive exploration tool may provide knowledge in theoretical-physics fields similarly to what it gave to the Go-game-community: the discovery of previously unknown underlying structures in the studied system.
For 200 years, science has focused on the application of analytical methods. This often implies local linearization, as well as stability and modal analyses. However, in many instances this is not meaningful, because there is no such thing as a clearly defined base flow that is similar enough to instantaneous snapshots to perform linearization. Therefore, full-scale applications may be far enough from a well-defined configuration that linearization is not representative. Consequently, in general, when handling non-linear high-dimensional systems, there is no choice but to study directly the full system. Thus, most analytical tools cannot be used anymore, and this is one of the reason why methods such as DRL are attractive: they can perform very well in much more general conditions than traditional methods, at the cost of being very data intensive. Hence, DRL represents a methodological shift. It will take time before we as a community understand the full value and impact of these methods in AFC, and we are currently in the middle of a massive effort to investigate how far DRL can be taken for AFC applications, and what new insights can be gained from that.