Machine learning for optical fiber communication systems: An introduction and overview

Optical networks generate a vast amount of diagnostic, control


I. INTRODUCTION
Machine learning (ML) is the study of computer algorithms that can learn to achieve a given task via experience and data without being explicitly programmed. 1 ML has been a topic of research within statistics and computer science since at least the 1950s, with early iterations of many algorithms used today invented in the last 30 years. 2 However, as a result of the increase in the availability of data and computing power over time, the use of ML has recently become ubiquitous across all disciplines of science and engineering.Optical fiber communications is no exception-there are now a great many works utilizing a range of ML techniques to solve a range of problems within the domain.4][5][6][7][8] However, given the rapid acceleration of the usage of ML within optical networks, there have been many works published in the domain that leverage ML since these reviews were conducted.Moreover, certain ML applications have recently begun to increase in popularity for optical networks problems, which we address in this Tutorial.Thus, in this Tutorial, we introduce the reader to ML, highlight the key ML techniques being deployed within optical fiber communication systems presently, and outline recent impactful works within each application sub-domain.
Optical fiber communication systems form the backbone of communications, having been deployed across the globe since the early 1980s. 9At a basic level, the edges of optical fiber networks are composed of optical fibers carrying modulated laser light, with optical amplifiers to combat loss of laser signal power incurred during propagation.The nodes of optical networks are comprised of transmitters, receivers, and switches.Loosely, the job of network operators is to carry messages between these nodes such that the quality of service agreed to customers is met.Different modulated laser signals, known as channels, are assigned different individual wavelengths and can then be transmitted through the same fiber link simultaneously-this is known as wavelength division multiplexing (WDM).Telecommunication systems are split into conceptual layers defined by the open systems interconnection model, 10 and in this Tutorial, we reference applications of ML in layers one and two, which we refer to as the physical layer and the network layer, respectively.In short, the physical layer concerns how raw bits are transmitted across a link between two nodes, also known as a light path.Contrastingly, network layer applications concern how to transfer TUTORIAL scitation.org/journal/appdata across the physical layer between given nodes.As an example, one can control aspects such as the route taken through the network, meaning the sequence of edges and nodes traversed, and the chosen wavelength channel that is used to carry the information between two nodes.Additionally, optical networks research is commonly carried out on specific network types, which are primarily defined by their scale.In ascending order of transmission length, these network types are access networks, which connect individual users to other users and data centers, metro networks at city scale, backbone networks at the scale of large countries and continents, and submarine systems, for connecting continents.Each of these network types has different constraints, for example, access networks have stringent monetary cost and complexity limits, whereas submarine systems have very strong power constraints.There are also data center networks, which are significantly different from all these network types due to their highly configurable topologies and extremely short reach links.In this Tutorial, we discuss works considering backbone, metro, and access networks.Optical fiber communication systems facilitate the transfer of information at high data rates, currently 10-100 s (and in some cases, greater than 1000) of Mb/s, 11 enabling many data-hungry applications.In fact, Cisco predicts that there will be 5.3 × 10 9 internet users by 2023, an increase from 3.9 × 10 9 in 2018. 11oreover, the average connection speed is expected to rise from 45.9 Mb/s in 2018 to 110.4 Mb/s by 2023. 11The optical fiber communication domain faces a number of key challenges that must be overcome to bring about this growth.First, optical fibers exhibit nonlinear behavior, governed by the optical Kerr effect. 12,13This means that the refractive index seen by a given wavelength of laser light propagating through the fiber is dependent on the electric field strength in the fiber.As a result, channels interfere in a nonlinear way both with other channels on the same fiber and with themselves.These nonlinear noise-like distortions due to channel interference are power-dependent, meaning that there exists a trade-off between the optical power of the signal and the strength of these nonlinear interactions. 14This introduces a level of complexity that makes physics-based modeling challenging in practical systems, making ML approaches look promising.Estimating the strength of nonlinear interaction and mitigating its effects form the basis of much of the research in optical fiber communication systems, including a large amount of works in which ML is applied.Furthermore, attempts to extend the range of wavelengths used to carry information beyond the traditional C−band, known as wide-band systems, require one to deal with some extra physical effects.Among them are the wavelength dependencies of fiber parameters, such as fiber loss (mainly, the elastic Rayleigh scattering 15 ), higher-order fiber dispersion effects, and the influence of the frequency-dependent fiber effective mode area. 16In addition, higher-order Kerr-type nonlinearities manifesting themselves as stimulated inelastic light scattering effects, i.e., stimulated Raman scattering (for very short optical pulses) [17][18][19] and stimulated Brillouin scattering (for very large launch powers), 20,21 should also be taken into consideration.ML approaches have shown potential in helping to deal with such effects, which may facilitate the use of wide-band systems in future networks.
Another critical problem in optical fiber communications is the high complexity of optical networks, which poses a significant operational challenge. 22As networks have evolved over time to carry a higher information throughput, the modeling of the optical communication channel has become more difficult due to the increased number of adjustable design and operational parameters. 3Perhaps the biggest driving force behind this has been the introduction of coherent technologies, 23 which increased the complexity of transmitters and receivers significantly.Moreover, the configurability of the network layer has increased due to advances such as software defined networking (SDN). 24In addition, future optical networks will be more dynamic, requiring automation as requests must be satisfied on shorter time scales. 25As a result, investigating the extent to which ML can help with modeling and network control has been the subject of a large volume of research.In this Tutorial, we focus on introducing the ML techniques that appear in the works we outline.Furthermore, we introduce a classification of algorithms in order to clarify the relationship between these techniques as well as outlining trends within optical communications such as which algorithm classes are used within each optical communication sub-domain.
The rest of this Tutorial is organized as follows.In Sec.II A, we introduce the general concept and nomenclature of ML, followed by a description of the specific techniques utilized by the works discussed in this Tutorial in Sec.II B. We then outline key research problems and selected interesting work within the physical layer in Sec.III, followed by an equivalent survey for network layer problems in Sec.IV.Selected opportunities for future research across both physical and network layer problems are highlighted in Sec.V, and concluding remarks are included in Sec.VI.

II. INTRODUCTION TO SELECTED MACHINE LEARNING TECHNIQUES A. Categorization of machine learning
First, algorithms can be categorized based on the type of problem that is being solved, i.e., whether it is a regression or classification problem. 26Regression algorithms make continuous predictions, such as the signal-to-noise ratio (SNR) of a light path in an optical network, and may have continuous or discrete inputs, also known as features.Classifier algorithms, instead, predict the class associated with a given set of inputs, for example, whether a request to connect two nodes in a network can be satisfied or rejected.A second distinction can be made based on whether the data are labeled or unlabeled. 26Algorithms requiring labeled data are known as supervised, for instance, a dataset of SNR as a function of the signal power for an optical channel.Each datum in this set has a label, the measured SNR, which the algorithm can use as a target when learning.Contrastingly, unsupervised algorithms involve learning from unlabeled data.This can be done by attempting to group these data based on similarity-known as clustering, or compressing the data by finding the features that are most important for distinguishing between examples and removing the remaining features-known as principal component analysis. 26An example of unlabeled data might be traffic flows in a network, which can be grouped into classes that are not pre-determined, but rather determined by the algorithm based on similarities in various features.There also exists another formulation of ML that is distinct from supervised and unsupervised learning, known as reinforcement learning (RL). 27In RL, the goal is to learn a policy for achieving a given task by interacting with the environment.Every action taken affects the environment and returns a reward, the value of which quantifies how successful TUTORIAL scitation.org/journal/appthe given action was in the context of the overall goal.Formulations of RL and various algorithms are discussed in Sec.II B 4. An example of an application of RL in optical fiber networking might be an agent that learns an optimal routing policy, which maximizes the total throughput of the network, given a series of requests.Here, the environment may consist of the current network state and outstanding requests, and the action space (the set of allowed agent actions) may consist of a set of candidate routes and channel wavelengths to choose from.A categorization of different ML techniques discussed in this Tutorial is outlined in Fig. 1.This diagram reflects the fact that supervised learning is more commonly used within optical communications than RL and unsupervised learning and that unsupervised learning is the least-used class of algorithms.Moreover, for physical layer applications, regression is more popular than classification, as we are often interested in predicting continuous target signals.Classifier algorithms are predominantly used in network layer applications where we are often interested in distinguishing candidate light paths that are suitably high quality from those that are not and predicting source and destination nodes of network traffic.Similarly, the majority of works applying RL in optical communications address network layer applications, which are often formulated as dynamic control problems.However, these are not absolute rules and there are exceptions.For example, Generative adversarial networks (GANs) and graph neural networks (GNNs) are commonly used in a regression formulation to tackle problems in network traffic prediction and generation.Rather, these are the general trends seen in the literature by the authors.
In the broader applied ML community, the types of data used can be categorized as structured tabular data, text data for natural language processing, image data consisting of sets of pixels, and time series data.The structure within tabular data may include spatial information, such as a graph, which can be represented as a matrix of edges and weights.Within optical fiber communications, the most common data types used are tabular and time series data.Furthermore, a further distinction can be drawn between batch and online learning.The more traditional batch learning approach involves learning from the whole training dataset, before deploying this model on new examples.Alternatively, online learning involves learning as data become available, updating the current model with information obtained from new examples. 28In the case of a NN model, for instance, online learning would involve adapting the weights of a trained model based on a small volume of data.One could therefore train the NN initially on a large historical dataset before fine-tuning the weights using new data from monitors via online learning.In the supervised case, the new data will be labeled with an example of a label being the SNR for a given set of operating parameters.Unsupervised online learning is also possible, and online algorithms for principal component analysis and clustering using neural networks are available. 29Here, the basic idea is to begin with a dataset that has been compressed in the case of principal component analysis or grouped in the case of clustering and modify the compression or grouping based on a new datum as it becomes available, rather than for all the data at once.Thus, the new datum is also compressed or grouped, which may, in turn, also change how the other data are compressed or grouped.A related approach to online learning is transfer learning, where we utilize information obtained from training a model for one task in order to reduce the computational effort required in training a model to perform another similar task. 30In other words, transfer learning involves starting with a trained model for an old task and adapting it for the new target task, rather than starting from scratch.For example, one can modify the weights of a NN that has been trained for another task, rather than starting with untrained weights, reducing the computational requirements of training.

FIG. 1.
Categorization of ML techniques discussed in this Tutorial.In general, supervised regression algorithms are more common in physical layer applications, whereas supervised classifiers and RL are more popular for network layer problems.Some techniques appear more than once as they can be formulated for different problem types (ANN: artificial neural network, ELM: extreme learning machine, CNN: convolutional neural network, GNN: graphical neural network, RNN: recurrent neural network, LSTM: long-short term memory, GRU: gated recurrent unit, GAN: generative adversarial network, MPNN: message-passing neural network, GP: Gaussian process, CBR: case-based reasoning, GCN: graph convolutional network, SVM: support vector machine, DT: decision tree, RF: random forest, KNN: K-nearest neighbor, PW: Parzen window, LDA: linear discriminant analysis, DQN: deep Q-network, and DDPG: deep deterministic policy gradient).

TUTORIAL scitation.org/journal/app
Finally, explainable ML is a growing field of ML that is crucial for ML applications as explainability increases confidence in ML systems. 31In this work, we follow the definition of explainability given by Roscher et al. 32 Specifically, explainable ML is transparent and interpretable and leverages domain knowledge.In this context, transparency means that the design of the ML model can be justified beyond empirical performance on the testing dataset; interpretability means that the ML model output is human understandable-we can reason as to why the model makes a given prediction for a given input; and domain knowledge broadly encompasses all the knowledge of the problem we possess before we have seen the data.A black box is a model for which the decision processes are not interpretable by humans and the design cannot be easily justified. 33There are two main approaches to explainability.5][36] Alternatively, there are those that try to replace the black box with a more simplistic or more mathematically principled model that is inherently more understandable.The former are commonly known as post-hoc techniques.Thus, a black box method can be made more explainable using extra add-on techniques or one can design the method from the ground up to be explainable.

B. Machine learning techniques 1. Neural networks
Neural networks (NNs) are universal function approximators, meaning that a sufficiently large NN structure can approximate any function. 37The structure of NNs is analogous to that of animal brains, consisting of a network of units, called neurons, connected via edges with associated weights.The neurons can send signals to one another along these weighted edges and process these signals.
The most commonly used type of NN in ML applications is a feedforward NN (FFNN).The mathematical structure of such networks is given by an input layer, followed by a series of layers of neurons, each representing a function that is applied to the previous layer in a chain rule-fashion. 38The final layer yields the model output, and the layers in between the input and output layers are known as hidden layers.As an example, consider a supervised NN model with a single hidden layer f (1) and an output layer f (2) , f (x | W (1) , b (1) , W (2) , b (2) ) = f (2) [ f (1) where x (i) is the input vector for layer i such that x (1) = x, W (i) is the matrix of weights in layer i, bi is a vector of additive constants known as biases in layer i, g (i) is the activation function applied element-wise to yield a vector output for layer i, and (⋅) T denotes the transpose of a given matrix.A pictorial representation of this NN, adapted from the work of Bishop, 26 is given in Fig. 2.
For this example network, nonlinear and linear activation functions may be applied to the hidden layer and output layer, respectively.If both g (1) and g (2) are linear, the entire NN model is itself simply a linear function of x.Therefore, nonlinear activation functions are crucial for approximating interesting functions.

FIG. 2.
Pictorial representation of the NN described in Eq. ( 1), with one hidden layer.The nodes depict the input variables x i and hidden variables z i .The edges represent the matrices of weights W (1) and W (2) , whereas the biases are represented by the weights from the additional variables x 0 and z 0 .
The term deep learning (DL) refers to NNs with at least one hidden layer-often, networks with multiple hidden layers are used.Choosing the structure of the NN, including the activation functions, is often done in an ad hoc trial and error fashion.As a result, NNs are often viewed as being black box opaque models, which are difficult for humans to interpret.In fact, the highly nonlinear layered structure of NNs is what makes them so flexible and powerful.Training NNs-the process of obtaining the optimal set of weights that solve a given problem and generalize well, for data not seen during training-can be achieved in multiple ways, the most commonly used of which is backpropagation and gradient descent. 39To train NNs, we first have to define a loss function that, for supervised learning, measures how far the predictions of the network are from the measured data; a commonly used loss function is the mean squared error (MSE).In backpropagation, the gradient of the loss function can be computed efficiently for a given training example input-output pair, allowing for NNs to be trained using gradient descent-update the weights in the opposite direction to that of the gradient, in order to move toward the local minimum. 40here are many extensions to the simple NNs described above, designed to solve a range of specific problems.However, the basic structure and methodology for learning remain the same.One such example is the autoencoder, which can be either supervised or unsupervised.An unsupervised autoencoder learns an efficient encoding of unlabeled data, whereas a supervised autoencoder can be used to obtain the set of inputs that yields a desired output.An autoencoder consists of a FFNN with two parts: the encoder that learns to map the input data to an optimal representation and the decoder that learns to decode this representation and recover the initial data. 38his structure is outlined in Fig. 3.
In  96 and Goodfellow et al. 38 The input is fed into a NN, known as the encoder, that learns an internal representation or code.A second NN, the decoder, learns to map this code to the output.
relationships found in time series data. 41This is achieved by considering the previous state of the network and the current input when determining the current state of the network.A schematic outlining the basic structure of a RNN is shown in Fig. 4. RNN models can maintain state information, allowing them to perform tasks such as traffic sequence prediction that are beyond the ability of a standard FFNN.However, RNNs are affected by gradient explode or gradient vanish problems 42 that prevent complete learning of the time series.Due to this issue, special cases of RNNs such as Gated Recurrent Units (GRUs) 43 and long-short term memory (LSTMs) 44 have been proposed that are capable of adaptively capturing dependencies on different time scales.
FIG. 4.An example RNN model architecture including context nodes u 1 , u 2 , . . ., un associated with each node in hidden layer vector z t with fixed weights of one.
Similarly to FFNNs [Eq.( 2)], at each time step t, the input x t is fed forward and a learning rule is applied.Additionally, the fixed back-connections save a copy of the previous values of the hidden nodes in the context nodes. 171rthermore, as optical networks have a topological structure that can be represented by a graph, it is natural to utilize graphbased machine learning techniques, such as Graph NNs (GNNs) that leverage the network structure. 45GNNs combine graph theory with NNs in a way that draws parallels with RNNs.There are two key sequential steps involved in updating a GNN for a given node: aggregation of the of the states of neighboring nodes, including the target node itself, followed by an update to the state of the node, depending on the specific analysis goal of the GNN. 46 Based on the variations of the aggregation and update functions, several models of GNNs have been proposed in the recent literature, such as message-passing NNs, 47 graph convolutional networks (GCNs), 48 graph attention networks, 49 and gated graph NNs. 50xamples of applications include classification and regression on nodes or edges, i.e., predicting classes or continuous values for these elements of a given graph.GNNs can also be supervised or unsupervised, providing some flexibility with regard to the application domain.
Another NN that has been used in network layer applications is the generative adversarial network (GAN). 51GANs achieve their unique capabilities owing to their design based on zero-sum game theory.At a high level, they are composed of two NNs, the discriminator and the generator, which compete against each other.A schematic showing the structure of a GAN is shown in Fig. 6.GANs are designed for realistic data generation and have been successfully used for both image and video data generation in the recent literature.Thus, GANs show potential for traffic data generation in optical networks.

Gaussian processes
Gaussian processes (GPs) are a probabilistic ML approach in which the uncertainty associated with predictions is wellquantified. 52This makes them attractive for optical fiber communication systems, in which the accepted failure rate is low and thus knowledge of the limitations of ML models is desirable.GPs can be used for regression or classification and are non-parametric methods, 53 meaning that no specific parametric form is assumed for the model but rather Bayes theorem is used to search the space of functions directly.In the context of GPs, the Bayes theorem can be stated as 52 posterior ≜ prior × likelihood marginal likelihood , where the posterior is the predictive distribution we wish to obtain, the prior contains the information we know about the target function before we have seen the data, and the likelihood includes information from the measured data.In general, we wish to condition our prior on the measured data in order to obtain the predictive posterior distribution.Figure 7, adapted from the work of Rasmussen and Williams, 52 demonstrates a function drawn from an uninformative GP prior, which is then conditioned on data to produce an accurate model.
In general Bayesian inference, this involves numerical integration to calculate the required posterior.However, in GP regression, we assume that the likelihood function is a Gaussian, which means TUTORIAL scitation.org/journal/appFIG.
5. An example architecture of a GNN for node-based predictions.The computational graph for target node A is shown on the right, where N(A) represents the neighborhood of node A, h (1) and h (2) represent hidden layers 1 and 2, respectively, and Γ and U represent the aggregation and update functions, 46 respectively.The complete GNN may comprise computational graphs for multiple nodes of interest.
that these integrals then become analytical and thus much less computationally expensive.This assumption is not valid for GP classification, however, making it more computationally demanding than GP regression models.GPs are a kernel-based ML method, in which the kernel trick-the fact that it is more computationally efficient to work in the space of inner products than fixed coordinates-is leveraged. 26As a result, the user must specify the kernel function at the design stage, which means making an assumption about the features we expect to see in the data.For instance, a commonly used kernel function is a squared exponential kernel plus a white Gaussian noise (GN) kernel, giving where ν and μ are scalar hyperparameters controlling the absolute scale and the length scale of the target function, xi and xj are data points, ∥⋅∥ denotes the Euclidean distance operator, and Ξ(σ 2 ) ∼  (0, σ 2 ) if i = j and 0 otherwise, where  (0, σ 2 ) denotes a zeromean Gaussian distribution with variance σ 2 .Choosing this kernel means assuming a priori, meaning before we have seen the data, that the function we are trying to learn has one length scale and white Gaussian noise.More complex kernels exist to describe features such as periodicity and decay, and one can design a kernel by noting that the sum of any two valid kernel functions is itself a valid kernel function.
GPs are trained by finding the optimal kernel hyperparameters via maximizing the log marginal likelihood in order to find the most likely interpretation of the data. 52Once optimal hyperparameters are found, the predictive distribution of the GP can be calculated using Algorithm 2.1 of Rasmussen and Williams. 52he predictive mean function and predictive variance of the GP can then be used to make probabilistic inferences about the data.One of the major issues associated with using GPs is the computational complexity, which is O(n 3 ), where n is the number of training examples.It is possible to use sparse approximations that reduce this computational burden, 54 at the cost of some accuracy.

Support vector machines
Another kernel-based ML method is the support vector machine (SVM), a method which can be used for supervised regression and classification 26 and for unsupervised learning. 55However, the vast majority of SVM use within optical networking is for classification, and therefore, we focus on SVM classifiers here.Unlike standard GPs, SVMs are sparse kernel methods, meaning that the model predictions do not require evaluation of the kernel function for all training examples, but rather we only need to evaluate the kernel for a subset of the training data.
SVM classifiers work by constructing a decision boundary that separates the labeled data into distinct classes such that the margin, defined as the perpendicular distance between the closest data points in each of the classes and the decision boundary, is maximized.These points that are closest to the boundary are known as the support vectors, so-called because they directly specify the position of the boundary.Being the closest to the optimal boundary, these points are also the most difficult to classify.Figure 8 shows the example of a binary SVM classifier, with the decision boundary and support vectors highlighted.

FIG. 8.
Diagram showing the support vectors for a binary SVM classifier, where the data are labeled 1 or −1, adapted from Bishop Chapter 7 26 .The margin is also shown, which we maximize in order to find the most general decision boundary.
As a demonstrative example to provide intuition for SVMs, we follow Bishop 26 and consider the simple case of a binary classifier, with data labeled as one of the two classes, tn ∈ (−1, 1), modeled by a linear decision boundary model of the form where w is a vector of weights, x is the vector of inputs, ϕ represents a fixed transformation in the input space, and b is a constant.A data point is classified depending on the sign of y(x).It can be shown that, as the distance of the points xn to the decision boundary is invariant under linear transformation, all data points satisfy the constraints and the distance from point xn to the decision boundary is given by Thus, we find the decision boundary by solving the constrained optimization It can be shown that this is a quadratic programming problem, which can be tackled using Lagrange multipliers.Once the optimal decision boundary is found, new examples can be classified by their position in the input space relative to the boundary.This is an oversimplification of the SVMs used, in practice, but should give the reader some intuition for how an optimal decision boundary can be found.
In practice, SVMs are formulated in terms of kernel space, as this allows us to keep the computational load reasonable by working in terms of inner products between the input variables.The kernel is defined in terms of the fixed transformation in Eq. ( 5) as Moreover, the method described above finds a hard decision boundary, which only exists for linearly separable data.In general, SVMs are formulated to find a soft boundary, allowing for some degree of misclassification.Finally, SVMs are not limited to binary classification and can be constructed to facilitate multiple-output classes.

Reinforcement learning
RL is a discipline of ML that involves a learner known as the agent that learns interactively by taking actions in its environment, where the environment consists of everything outside of the agent. 56he environment can be simulated or experimental; a simple example for the case of optical fiber communication networks could be the currently established light paths, current requests, and the SNR of these light paths.
Here, we outline the key concepts of RL, following Chap.3 of Sutton and Barto. 27The agent interacts with its environment at a series of time steps t, t + 1, t + 2, . ... At each time step t, the agent takes as input some representation of the state of the environment St ∈ S and chooses an action At ∈ A, where S is the set of all possible states and A is the set of all actions that are possible given a state St, respectively.In the proceeding time step, the agent reaches a new state S t+1 and receives a reward R t+1 ∈ R ⊂ R. The method by which the agent selects the action At given a state St is called the policy, denoted as Πt, a mapping from states to probabilities of selecting each possible action.Informally, the goal of the agent is to maximize the cumulative reward received over time.A schematic showing how the agent interacts with its environment, adapted from the work of Sutton and Barto, 27 is shown in Fig. 9.
We also make a distinction between two types of agent-environment interaction: continuous tasks in which the number of time steps is infinite and episodic tasks, for which the interaction consists of a series of episodes each with a terminal time step.We denote this terminal step as T, and as RL has been applied to both continuous and episodic tasks in optical networks, we introduce a general notation that is valid for both types, in which a continuous task is represented by T = ∞.Thus, the agent aims to maximize the expected discounted return, where κ ∈ [0, 1) is a parameter called the discount factor, which controls the value of future rewards at the present time step.If κ = 0, the agent will learn to maximize the immediate reward, whereas as κ approaches 1, the agent will strongly weight future rewards Diagram showing the interaction between the RL agent and the environment.At time step t, the agent receives a state S t from the environment and chooses to take an action A t .In the proceeding time step, this yields a state S t+1 and returns a reward R t+1 .By iterating through this process, the agent learns to maximize the long-term reward.
when choosing a policy.An important element of the RL framework is that we desire to have a state representation that conveys to the agent all relevant information about the environment such that the probability of entering a specific new state at t + 1 can be defined only in terms of the state and action representations at t.In other words, we do not need the entire set of previous states and actions to find an optimal policy, but only the state and action at the previous time step.State representations that satisfy this are said to have the Markov property, and tasks that involve learning with a Markov state are called Markov decision processes (MDPs).For a finite MDP, meaning an MDP for which the state and action spaces are finite, we can completely determine the dynamics by the probability distribution where s and a are a given state and action, s ′ is the new state, and r is the reward received.Here, it is assumed that s ∈ S, a ∈ A, and r ∈ R.
Using Eq. ( 11), we can compute all other quantities needed by the RL agent.
In order to learn an optimal policy, RL algorithms attempt to estimate the value function, defined as the expected value of the cumulative reward obtained by starting in a state s and following policy Π, Crucially, it can be shown that Eq. ( 12) follows a recursive relationship that has the form of a Bellman equation, 27,57 This relationship allows the agent to compute an approximation to ν Π .The agent's goal of maximizing the long-term cumulative reward can be stated as finding the policy that has an optimal value function, and we can write a Bellman equation Here, ν * denotes the optimal value function, which may be achieved by more than one policy but will always exist for a finite MDP.In practice, the computational cost of computing ν * exactly is too high, and thus, we learn a suitably good approximation.
There are a number of different algorithms for finding Π * , and these algorithms can be either model-based or model-free. 58Modelbased RL algorithms are concerned with computing an optimal policy for a MDP, assuming that a perfect model of the environment is available.Contrastingly, model-free algorithms do not rely on the assumption that such a model exists, but rather sample the MDP to obtain statistical knowledge about the unknown model.Such algorithms do not attempt to construct a model of the environment.Moreover, RL algorithms can be further categorized: for on-policy approaches, the agent will update its action-value function using the action determined by the current policy, whereas for off-policy approaches, a different policy is used to select the action. 27Commonly, off-policy algorithms will utilize the ε-greedy policy, in which a threshold ε ∈ [0, 1] ⊂ R is selected, and at each time step, a random real number is generated between 0 and 1.If the value of this scitation.org/journal/appnumber is greater than ε, the agent will perform the action that maximizes the expected cumulative reward; otherwise, it will perform a random action.This demonstrates the trade-off between exploration and exploitation that is crucial within RL-only exploiting current knowledge leads to short-sighted policies, but we need to refine successful policies to achieve high performance.Therefore, it is important to allow some degree of continuous exploration of the environment to achieve a policy that is optimal in the long-term. 59ne final distinction that will be encountered in the RL literature is that of value-based algorithms, in which the value function is parameterized in order to find an approximation to the optimal policy 60 and policy-based algorithms, where the policy is parameterized instead. 60Finally, it is possible to combine these approaches by utilizing two learners, known as the actor and the critic.The actor learns the optimal action to take for a given state, and the critic learns to compute the value function of a given action. 27Below, we summarize the specific RL algorithms used by works referenced in this Tutorial, highlighting useful references for the reader.The algorithms included in this section are within the scope of deep reinforcement learning (DRL), a sub-field of RL that has become of great interest in the recent literature owing to its successful adaptations in several application domains. 61DRL relies on the intersection of reinforcement learning (RL) and deep learning (DL).In general, DRL algorithms incorporate DL to solve MDPs, often representing the policy or other learned functions as a NN.
Deep Q learning is a model-free value-based DRL algorithm that involves trying to find an optimal action-value function for a policy Π. 62 The key idea is to use a deep NN (DNN) to estimate the optimal action-value function, This method is best suited to solving RL problems with discrete lowdimensional action spaces. 63synchronous advantage actor-critic, or A3C, is another model-free DRL algorithm.In contrast to valued-based deep Q−learning, A3C is policy-based and the policy is parameterized by a NN in order to learn an approximation to the optimal policy, 60 The asynchronous aspect of A3C comes from the fact that multiple agents are trained in parallel on copies of the environment, providing asynchronous updates to the model weights.Crucially, this results in greater exploration of the space and hence improved performance over other algorithms such as deep Q learning for a number of tasks. 60nother commonly used RL algorithm is deep deterministic policy gradient (DDPG), 63 an extension of the deterministic policy gradient (DPG) algorithm 64 inspired by deep Q learning.The key idea behind DPG is to assume a deterministic policy, the gradient of which can be shown to follow the gradient of the action-value function q(s, a).In DDPG, this is extended by using DNNs to parameterize the actor function and by employing some innovative techniques from deep Q learning and DL. 64The resulting algorithm is effective for exploring continuous action spaces, addressing a shortcoming of deep Q learning.

III. PHYSICAL LAYER APPLICATIONS
In this section, we outline several key research problems within the physical layer and highlight selected applications of ML to these problems from the literature.Specifically, we discuss quality of transmission (QoT) estimation, digital twins, equalization in short reach applications, and fiber nonlinear noise mitigation in long-haul transmission systems.A summary of the works discussed detailing the physical layer applications tackled and different ML techniques proposed is given in Table I.

A. Quality of transmission estimation
One of the most widely researched applications of ML in optical fiber communications is QoT estimation, evidenced by a recent survey focusing on this application alone. 6QoT is an umbrella term for a number of metrics of the quality of a transmitted optical communication signal, including SNR, bit error rate (BER), and Q-factor. 65ML techniques are a logical approach to QoT estimation because of the numerous sources of uncertainty that make the estimation and prediction of QoT challenging 6 and the necessity of QoT estimation for performing network level control, such as for the routing and spectrum assignment of new light paths.A number of models of QoT exist that are based on the physics of transmission within the fiber, which have varying degrees of accuracy.Two commonly used examples are the Gaussian noise (GN) model 66 and split-step Fourier transform method (SSFTM). 67However, these are plagued by limited applicability due to limited accuracy and high computational requirements, respectively.Moreover, both are limited by uncertainty in the physical layer inputs, with the magnitude of these uncertainties varying between deployed networks.For example, installed fibers can be accidentally damaged, before being spliced back together, resulting in variations in the fiber attenuation.Moreover, other components such as amplifiers and filters can suffer degradation in performance as they age, 68 which can change physical layer parameters such as EDFA noise figure.Additionally, parameters such as the fiber type and fiber chromatic dispersion (CD) may not be known to the operator in deployed networks. 69ML can be used either as a replacement for physics-based models or alongside them in order to combat the input uncertainty and to reduce the computational burden.
The QoT estimation sub-domain can be further divided into three main problems.First, ML can be used to predict the QoT from physical layer inputs, such as the number of channels, operating wavelength, modulation format, and number of spans.This can be formulated as a regression problem, where the QoT itself is the target, or as a classification problem where the goal is to predict whether or not a given light path will have sufficient QoT.Second, ML can be deployed to aid with QoT monitoring, commonly to learn the mapping between the variables that are measured by using monitors and the QoT, often for the purpose of prediction of failures. 3,6inally, the modeling of the optical amplifiers used in optical fiber communications presents a challenge due to the nonlinear dependence of amplifier gain on wavelength, channel launch power, and the number of channels.As amplifiers can have a significant effect on the QoT, there have been a number of works in which ML has been applied to modeling amplifiers. 3,5,6ere, we outline selected examples of ML applied to QoT estimation from the literature that demonstrate what is typical in TUTORIAL scitation.org/journal/appthe field.Interesting works using regression include using a simple learning process based on gradient descent 40 to reduce the uncertainty in the inputs to a physics-based QoT model. 70This represents a hybrid approach where ML is used in concert with physical models of the QoT, rather than relying solely on the data.A similar approach was also demonstrated experimentally-a learning process was used to update the parameters of a physical model based on measurements of the Q-factor of an experimental system. 71Additionally, ML based on the Levenberg-Marquardt algorithm 72 (LMA) was recently utilized for online optimization of the inputs to the GN model, specifically the launch powers, for a simulated network. 73Interestingly, the number of iterations used for the optimization is adaptive, which reduces the time and measurement resources required to perform the optimization.Again, the role of ML here is to configure the inputs to the physical model, rather than replacing it.There are also approaches in which the goal is to replace the physical model.For example, a GP regression model has been used to learn the functional relationship between the BER and system transmission parameters, specifically the launch power, length of fiber over which the signal is transmitted, symbol rate, and channel spacing. 74This model was trained on both simulated and experimental data, and it was shown that the model could make accurate predictions on a system with a different configuration to that upon which it was trained.As many of the QoT estimation works utilize NNs, this Tutorial highlights that more principled approaches such as GPs with well-quantified predictive uncertainty can also be used successfully for QoT estimation.Moreover, an experimental network has been operated at a reduced margin via a case-based reasoning (CBR) approach, 75 where margin means the difference between the minimum acceptable QoT and the current signal QoT.
In CBR, the QoT for established light paths is stored and used as a lookup table to estimate the QoT of new light paths that take a similar route through the network.This work is particularly interesting as it demonstrates that ML, albeit a simple version of it, can be useful for controlling an experimental optical fiber network-in this case, it allows us to reduce the required margin.Another recent experimental demonstration of the efficacy of ML-based QoT estimation utilized NNs trained on synthetic QoT data to estimate the SNR on a live network operated by Tele2 Estonia. 76Crucially, these models demonstrated a maximum SNR error of 0.5 dB and were able to compute the SNR estimate on microsecond scale, indicating that such models could feasibly be deployed in real networks.
DNNs have also been used recently to estimate the SNR based on historical telemetry of the optical amplifiers in an experimental system, focusing on the effect of the amplifiers, rather than the nonlinear noise generated by transmission in a fiber, which they assume can be estimated using a physical model. 77Moreover, a NN-driven nonlinear SNR estimator was presented, for which the optimal combination of input features was found. 78In this work, knowledge of the physics of fiber transmission is used to aid with feature engineering, in order to obtain the set of input features with the highest efficacy.
Classifiers have also been leveraged for QoT estimation, such as a binary NN classifier trained on historical network data, which was used to determine whether or not a given request will have sufficient QoT to be established. 79The performance of this classifier TUTORIAL scitation.org/journal/appwas compared to that of an analytical QoT model 80 was found to efficiently replace this model, while providing a key benefit of selfadaptivity to changes in the network conditions.Another work 81 utilized an SVM classifier model, again as a binary classifier designed to label light paths as having sufficiently high QoT to be established or not.Simulated data are used for training, as is common in networkscale research due to the lack of availability of detailed datasets from deployed networks.Furthermore, an interesting research avenue within QoT estimation is the use of physics-based models in concert with ML.This can be done in a number of ways, for instance, our physical models can be embedded into the ML directly.For example, a methodology for the training of NNs that obey physical laws defined by partial differential equations was recently presented. 82The first steps toward using this in optical fiber communications have been taken, where a physics-informed NN was used to solve the nonlinear Schrödinger equation (NLSE) in an optical fiber and model pulse evolution. 83An alternative approach is the physics-informed GP regression method, in which a physical model, in this case the SSFTM, is embedded within the GP. 84This allows one to train GPs with fewer measurements of the system and represents an explainable ML approach with a well-quantified prediction uncertainty.Additionally, there are works such as those described above, 70,71 which focus on learning more accurate inputs to a physics-based QoT model.A similar approach has been applied to nonlinearity estimation. 85Specifically, ML is utilized to reduce physical model errors and to combine modeling and monitoring schemes for nonlinearity estimation.Moreover, it is possible to use our knowledge of system physics to improve ML in other ways, such as to engineer higher performing input features. 78inally, a recent paper 86 highlights the remaining roadblocks that stand in the way of effective deployment of ML in QoT estimation.Specifically, due to competition-related concerns, telecommunication companies are not willing to give external researchers access to real network datasets, resulting in a reliance on simulated data, or data that are produced using a lab setup.Due to the limitations of physics-based models outlined above and the fact that a lab-based network is always going to be more idealized than a deployed network, such data may not be fully representative of deployed network data.As a result, the true efficacy of ML approaches for deployed networks is unknown.Moreover, many of the applications of ML in optical networks utilize error metrics that are standard in ML but may not be suited to optical networks.For example, it has been found that for optical network applications of ML, using only the mean squared error may result in an inflated measure of model efficacy and novel error metrics have been recently proposed to address this. 87Thus, although the first problem is tricky to address and is largely up to network operators, the second problem provides an interesting avenue for further research.

B. Digital twins
Digital twins are models that act as a virtual copy or "twin" of a real system.They are inherently data-driven, 88 taking as input measurements from the real system to build up a model of its governing physical laws, states, and behavior.Information drawn from the digital twins can then be passed to the real system in the form of changes to its operational configuration.This framework is outlined in Fig. 10.As we move toward higher levels of automation in optical communication network design and operation, digital twins are gaining increasing popularity within the research community. 89t is hoped that digital twins can help bridge the gap between the ideal physical layer that is commonly assumed in optical communications and physical layer behavior in deployed networks, which is far from ideal.Although ML is not a required component of digital twins, due to their data-driven nature it is natural that ML approaches can be useful for creating digital twin models.ML can be used as the basis for the digital twin itself-we can take measurements from the real network and train a sufficiently complex ML algorithm to emulate the behavior of the network.Alternatively, we can build the digital twin from physics-based models and utilize ML to reduce the gap between these models and reality.For example, we can use ML to reduce the uncertainty in the model inputs, as discussed in Sec.III A. Additionally, ML can also be used in order to extract more information from network monitors, which may allow for the development of more detailed digital twins.
A framework for applying digital twins in optical networks has recently been proposed, 89 focusing on three crucial applications: fault prediction, hardware configuration, and simulation of transmission.Different ML approaches from the literature are proposed for each of these applications.For fault prediction and diagnosis, two models are proposed, a RNN to extract the operating state from time series data taken from monitors 90 and an XGBoost 91 model to map information from network monitors to new features to aid with fault diagnosis. 92Moreover, DRL is proposed to learn an optimal strategy for hardware optimization. 93Specifically, the agent learns to control the configuration of the programmable optical transceiver in order to maximize the QoT for varying operating conditions.Finally, a RNN-based approach is proposed to learn a model of the physical layer transmission in the network as a function of time series monitoring data. 94Thus, the digital twin is created by combining these models, continually updating them with new data and using them to control the network. 89Another recent work demonstrated a digital twin model based on an autoencoder, which is trained on an open-source dataset of power spectral density (PSD) profiles before and after transmission through an experimental optical network. 95,96pecifically, this model is used to find the input PSD that produces FIG. 10.Schematic showing the digital twin framework adapted from the work of Wang et al. 89 Monitoring data from the physical network is stored in a database, and information is extracted from these data from which a virtual model is built.This model is used to provide feedback to the physical network, while any changes to the network state are mapped back to the virtual model.

TUTORIAL
scitation.org/journal/app a desired output PSD.Thus, this model can be as part of a digital twin to achieve optimal of the network.It should also be noted that PSD may be a less widely understandable QoT metric and that methods to obtain the optical signal-to-noise ratio (OSNR) from PSD data have been proposed that could be used to convert these data, either before or after training the autoencoder.Other works have successfully utilized autoencoders for end-to-end learning of an intensity modulation direct-detection (IM/DD) optical communication system, outperforming conventional signal processing techniques. 97This has been recently extended to include optimization of the symbol distribution for coherent systems. 98Such techniques may be of use in the development of digital twins as they constitute an end-to-end virtual model of the system with inherent mapping and feedback between the virtual model and the physical system.

C. Equalization in short reach applications
Optical short reach systems, defined as having a length less than 100 km, are applied in server-to-server, intra-data center, inter-data center, access, and metro links.Due to stringent requirements of low complexity and cost, minimal power consumption, and small carbon footprint, IM combined with DD with simple on-off keying (OOK) or pulse amplitude modulation (PAM)-4 modulation format is still a preferable transceiver technology compared to coherent systems. 99ncreasing demand for high data rate short reach applications such as IM/DD based systems causes several performance limiting factors that need to be addressed.A schematic of a short reach link with possible sources of linear and nonlinear impairments is shown in Fig. 11.First, chromatic dispersion (CD) severely limits the link power budget margin.With a high symbol rate and several kilometers of transmission, the interaction of CD and DD causes a power-fading effect and the detected signal may contain frequency notches.DD is based on square law detection, which complicates the CD equalization, as we cannot simply multiply the received signal spectrum with the inverse of the CD transfer function as in coherent systems.Another common impairment in short reach systems is considerable low-pass filtering effects due to the insufficient bandwidth of various components, which can cause severe inter-symbol interference.Furthermore, as short reach systems often have constrained financial budgets, lowcost components produce non-idealities, resulting in performance degradation.Similarly, low-cost devices such as lasers, modulators, photodiodes, and trans-impedance amplifiers also produce nonlinear distortions, such as level-dependent skew and level-dependent noise. 100or equalization of linear impairments, a feed-forward equalizer (FFE), usually based on a finite impulse response filter, is commonly used.The effect of frequency notches cannot be mitigated by using a FFE, although a decision feedback equalizer (DFE) can be added after a FFE to combat such an effect.However, DFEs may suffer from error propagation and instability due to the decision feedback scheme.Moreover, FFEs/DFEs cannot mitigate the nonlinear effects.Volterra nonlinear equalizers are an effective way to mitigate both fiber nonlinearity and component nonlinearities. 101However, the major drawback of this equalizer is the large implementation and computational complexity.
Recently, ML techniques attracted significant attention for equalization of short reach systems.Among different ML-based techniques, NN-based equalization is in the center of this interest.A sufficiently large NN having at least one hidden layer can approximate any function and thus can be used as an equalizer of both linear and nonlinear impairments.Usually, the input vector of the equalizer corresponds to a set of consecutive sampled symbols.The length of vector should be long enough to consider the channel memory.The NN can be structured with a single hidden layer and large number of nodes or multiple hidden layers (i.e., a DNN) with relatively fewer nodes.The choice of nonlinear activation function in each hidden layer is important as it enables approximation of nonlinear functions to deal with the distortion of short reach systems.The commonly used hidden layer activation functions are the sigmoid function, the rectified linear unit (ReLU), and the hyperbolic tangent (tanh) function.On the other hand, the Softmax activation function is usually chosen for the output layer, as this function facilitates making symbol decisions for any PAM-N signal in addition to the equalization. 102In several experimental demonstrations, it has been shown that NN-based equalizers outperform conventional equalizers, such as FFE and Volterra nonlinear equalizers. 102,103In addition, a field programmable gate array (FPGA) implementation of a fixed point DNNbased equalizer was demonstrated for high-speed passive optical networks. 104he CNN-based equalizer was also investigated by Li et al. 105 As the convolution layer acts as a multi-channel nonlinear learned local pattern detector, it allows the equalizer to overcome the inter-symbol interference and device nonlinearity.In CNN-based nonlinear compensation, the time series input signal is converted to a 1D input array with N elements comprising (N − 1)/2 past and post-symbols, followed by the multiple convolutional layers and fully connected layers with a nonlinear activation function.Experimental demonstrations showed that the CNN-based TUTORIAL scitation.org/journal/appapproach yields a considerable improvement as compared to a DNN-based approach. 105,106Ns also exhibit powerful equalization capabilities compared to feed-forward Multilayer perceptron (MLP) CNNs, as they can use the feedback of past output values as an additional input while calculating the present output value.107,108 With such additional feedback information, RNNs perform better than FFNNs, which is analogous to the performance improvement given by the combination of FFE and DFE compared to FFE only.Auto-regressive RNN and layer RNN are two commonly used types of RNN, and the former has better equalization performance.109 An RNN-based equalizer with parallel outputs was investigated using an FPGA implementation for 100 Gb/s passive optical network application.110 As a variant of RNNs, LSTMs were also demonstrated for the equalization of a 50 Gb/s PAM-4 transmission system.111 In addition to various NN-based equalizers, SVM-based approaches have been demonstrated as an effective tool for mitigation of nonlinear impairments in a short reach application scenario. 112,113he computational complexity of the nonlinear equalizer is a critical issue for short reach in optical communications because the equalizer needs to be implemented in real-time operating at an extremely high symbol rate.It has been shown that a NN-based equalizer with a single hidden layer can provide better performance with lower computational complexity compared to a Volterra equalizer.114,115 However, a comprehensive analysis of computational complexity and performance for various advanced ML-based equalization approaches is required.In addition, the techniques for reduction in complexity need to be explored.Given that there is significant potential for practical NN-based equalizers to be implemented on digital signal processing (DSP) ASICs, MLbased equalization may become the mainstream technology for next generation short reach IM/DD-based systems.

D. Fiber nonlinear noise mitigation in long-haul transmission systems
In long-haul fiber transmission systems, the optical signal suffers from fiber nonlinear noise-like distortions due to the optical Kerr effect.Generally, the following system of coupled NLSEs is used to describe the evolution of complex-valued envelopes of the electrical field in the optical fiber: 116 where E(z, t) = [Ex(z, t), Ey(z, t)] T is the Jones vector, α is the fiber loss coefficient, β 2 is the fiber group velocity dispersion coefficient, and γ denotes the fiber nonlinear coefficient.
Although the SSFTM can be used to numerically solve the NLSE, the accuracy is low when the interplay among signal, noise, nonlinearity, and dispersion effects is considered.Therefore, the performance improvement of the conventional digital backpropagation (DBP) method based on the NLSE is limited. 117Since the performance improvement is related to the modeling accuracy, ML techniques can be applied to describe the evolution of the optical signal after long-haul transmission.Specifically, ML techniques are applied to find a nonlinear function f that can map the received symbol to the transmitted symbol under certain criteria.
Unlike in short reach transmission scenarios, the nonlinear function f has to be obtained by separating the I and Q branches of the complex-valued signal.In the early works, 118 an artificial neural network (ANN) has been used in a coherent receiver after CD compensation with extreme learning machine (ELM)based training techniques.The simulation results for 27.59 GBd/s return-to-zero (RZ) quadrature phase shift keying (QPSK) show that the ELM-based technique can provide similar performance to conventional DBP with much lower computational complexity after 2000 km standard single-mode fiber (SSMF) transmission.Recently, LSTMs have been proposed to mitigate the fiber nonlinear impairments in dual polarization WDM transmission systems.It was shown in simulation that LSTMs can provide better performance than conventional DBP techniques with six steps per span. 119t is known that the nonlinear noise is non-Gaussian distributed.Therefore, conventional linear boundaries are not effective in the nonlinear fiber channels.One general idea of ML-based coherent receivers is to design nonlinear decision boundaries.These are assumed to be more suitable for the nonlinear fiber channel because the nonlinear noise generated in the fiber channel need not be a Gaussian distribution.A few techniques have been applied to design such nonlinear classifiers.An M-ary SVM has been introduced to mitigate the nonlinear phase noise in the single-channel single-polarization (SCSP) 16-QAM coherent optical systems.Compared with the linear channel equalization case, the simulation results show that M-ary SVMs can increase the optimal launch power by around 4 dB and extend the transmission distance by around 1200 km. 120The K-nearest neighbor (KNN) algorithm has also been utilized to mitigate the channel impairments, including the laser phase noise and nonlinear fiber noise.The simulation results show that the optimal launch power can be enhanced by ∼0.4 dB in the SCSP 16-QAM coherent transmission system. 121Another work using K-means clustering 122 experimentally investigated the requirements of the length of the training symbols for the fiber nonlinear mitigation in the SCSP 64-QAM 80-km transmission scheme.It was observed that a 10% training overhead is sufficient to obtain the optimal performance.Another recent publication utilizing nonlinear classification is based on the Parzen window (PW) classifier technique, which is inherently a multi-class technique and can be implemented in online learning mode. 123Considering the DBP technique as a benchmark, simulation results prove that a PW classifier can further improve the performance by ∼0.35 and ∼0.2 dB for 16-QAM after 1600 km and 64-QAM after 480-km fiber transmission.A density-based spatial clustering of applications with noise algorithm was employed for blind fiber nonlinearity compensation. 124The experimental result showed that this algorithm can provide up to 0.83 and 8.84 dB enhancement in the Qfactor when compared to conventional k-means clustering and linear equalization, respectively, in a 40 Gb/s 16-QAM system after 50-km SSMF transmission.A histogram-based clustering TUTORIAL scitation.org/journal/appalgorithm was also demonstrated in a coherent optical long passive optical network, which achieves a Q-factor 0.57 dB higher than that achieved using maximum likelihood and dB higher than that obtained using k-means clustering. 125In another recent work, an FPGA-based real-time fiber nonlinearity compensator using the sparse K-means++ clustering algorithm was experimentally demonstrated in a 40 Gb/s 16-QAM self-coherent optical system.This resulted in a 3 dB improvement in the Q-factor compared to linear equalization at 50-km transmission. 126More recently, a DNN-based nonlinear classifier with a cross-entropy cost function was used as a soft-demapper for soft-decision forward error correction (FEC). 127In optical coherent 92 GBd dual polarization 64-QAM 950 Gb/s back-to-back measurements, the DNN-based nonlinear classifier is shown to have better performance than pruned Volterra nonlinear equalizers by 0.35 dB in OSNR with equal complexity or achieve the similar performance with 65% less computational complexity.
The above ML techniques in optical communications are operated as a black box to obtain the data-driven models with unparalleled performance.Therefore, some works have tried to contribute more insights into how the nonlinear fiber noise is mitigated by the ML techniques.Recently, the structure of a NN is designed to be similar to the DBP structure, which is called a learned DBP algorithm. 128t is known that the conventional DBP algorithm is a cascade of linear filters D −1 for CD compensation and nonlinear operations N −1 for nonlinear phase derotation, as shown in Fig. 12(a).Each linear filter D −1 is given by the frequency-domain transfer function where L k is the length of the kth span.The nonlinear operation N −1 for the kth span is given by δ realized based on a time-domain finite impulse response filter and the filter coefficients are adjusted during training of the NN.Therefore, the interleaving linear and nonlinear processing in DBP can be regarded as the linear and nonlinear operations in the multilayer NN, as shown in Fig. 12(b), where the input is the received samples and the output is the estimated symbol sequence.In this case, the parameters ξ k and the filter coefficients of D −1 can all be optimized via ML techniques.An experimental demonstration is also conducted to evaluate the effectiveness of the learned DBP algorithm in a DP 5 channel WDM transmission system considering other channel impairments in a coherent transmission system, including frequency offset and laser phase noise. 129The experimental results show that 1-steps per span and 2-steps per span learned DBP provide an additional gain of 0.25 and 0.45 dB over conventional 50-steps per span DBP and a total gain of 0.85 and 1 dB over linear equalization, respectively.It is also shown that learned DBP can give an insight into how and what the NN learns, which may guide people to analyze the interplay between CD, nonlinearity, and noise more closely.As of the complexity, it is shown that the performance of learned DBP based on 1 step per span is better than conventional DBP with 50 steps per span. 130Note that the performance improvements of learned DBP originate from optimizing the parameters in DBP, and it incurs no additional computational complexity.
In another method, perturbation terms are used to analyze the fiber nonlinear terms, which can be expressed as 130 where P 0 , Hm, Vm, Cm,n are the optical power, sample sequences for x and y polarization, and the perturbation coefficients, respectively.In the conventional method, 131 the perturbation coefficients Cm,n can be analytically computed, given the link parameters and signal pulse duration/shaping factors.Alternatively, the perturbation coefficients Cm,n can be obtained via a two-layer NN, which can describe the model with higher accuracy by taking into account higher-order nonlinearities.In a single-channel 32 GBd DP-16QAM transmission system, ∼0.6 dB Q-factor improvement is observed after 2800-km SSMF transmission when the transmitted symbols are pre-distorted based on the estimated perturbation coefficients via NN.ML-based compensation for multicarrier modulation formats has also been investigated.For the orthogonal frequency-division multiplexing (OFDM) format, an ANN was proposed, which provides 2 dB Q-factor improvement for the 40 Gb/s 16-QAM signal after 2000 km fiber link. 132This improvement increased to 4 dB at the data rate of 70 Gb/s.A multiple-input and multiple-output-DNN-based nonlinear dispersion compensator was also demonstrated for the 40 Gb/s coherent OFDM system that achieved significant power margin improvement over both a conventional linear equalizer and a single-input single output DNN. 133Considering the same experimental setup, support vector regression shows 1 dB Q-factor improvement over the full-field DBP method for 40 Gb/s 16-QAM OFDM over 2000 km SSMF transmission. 134n a further work, a Newton-based SVM method that requires significantly less computational load than a conventional SVM TUTORIAL scitation.org/journal/appwas proposed to extend the launched optical power by 2 dB compared to the Volterra-based nonlinear equalizer. 135Finally, we consider the issue of flexibility in nonlinear channel equalizers.A general question concerning flexibility is whether we need to repeat the training process when the channel conditions (modulation format, launch power, transmission distance, etc.) are changed.In order to solve this issue, transfer learning has been proposed recently to reuse some parameters from the NN model trained for the previous system to build a new NN model that fits the modified system with a smaller amount of training resources. 136The simulation results indicate that the number of epochs or size of the training dataset can be reduced by up to 99% when transfer learning is used.Therefore, a fast re-configurable nonlinear equalizer is possible for the practical implementation of optical networks.

IV. NETWORK LAYER APPLICATIONS
In this section, we describe crucial research domains within the network layer and highlight selected ML approaches to tackling the problems in these domains from the literature.Namely, these domains are network traffic prediction and generation and core network parameter optimization.As detailed below, we find that supervised learning approaches have been successfully deployed in the former domain, whereas RL approaches have shown great potential in the latter.Table II summarizes these applications, highlighting the advantages of the particular ML methods employed.

A. Network traffic prediction and generation
8][139] The optical network operates based on a time scale and can be divided into time steps or iterations.In particular, in each time step/iteration, a number of demands arrive to the network, some of which are established.Every demand can be described by the time step in which it appears, a source node that represents the demand initial node and a destination node that represents the demand final node, demand volume, and holding time. 137n a real-time flexible networking scenario such as elastic optical networks (EONs), where the network can adapt to accommodate the incoming traffic, 140 ML techniques coupled with dynamic routing algorithms can improve the overall network performance significantly. 141One of the key challenges in increasing the efficiency of network operation is to predict the bandwidth requirement in the next time step based on the measurement of traffic characteristics in real time.When using ML methods, the goal is to forecast future traffic rate variations as precisely as possible based on the measured history.
NN-based approaches are the most commonly used ML technique in the literature of traffic prediction, [141][142][143][144][145] with early research utilizing standard ANNs. 1413][144][145] Moreover, others employed NNs with an improved optimizer such as Zhan et al., 146 who utilized a NN model optimized by the adaptive artificial fish swarm algorithm to predict tidal traffic.
Variations of NN approaches appearing in the state-of-the-art of traffic prediction include RNNs, such as Gated Recurrent Units (GRU) and LSTM owing to their capability of adaptively capturing dependencies on different time scales (see Sec. II).GRU is studied to make predictions of traffic matrices for a fixed-grid WDM network 142 and for a backbone EON. 143LSTM is studied for traffic prediction in passive optical networks 144 and for core networks. 137igure 13 describes an example of a traffic prediction model based on GRU.
Another recent type of NN studied in the traffic prediction literature is GNNs.In the context of network topology based traffic data, the ability of GNNs to leverage a graphical representation to learn inter-node dependencies of the network graph shows strong potential for applicability in this domain.Gui et al. 143 studied the

TUTORIAL scitation.org/journal/app
pair-wise spatial correlations between optical network nodes using a graph.The nodes of this graph represent switch traffic, and the weights of edges denote connections among network nodes.A GCN was then employed to leverage these spatial correlations.Vinchoff et al. 145 employed GCNs and GANs for prediction of traffic bursts in the optical network.Three types of burst events were modeled, namely, plateau, single-burst, and double-burst, representing steady traffic, a rapid traffic spike followed by a steady decrease, and a rapid traffic spike followed by an unexpected greater traffic spike, respectively.Another ML approach that has been successfully applied to traffic prediction is GPs.The ability of GPs to capture temporal aspects of traffic flows allows both the short term and longterm prediction of input traffic.Studies have shown agile management of resources in a core optical network using GP-based traffic prediction. 139,147ecent comparative studies 137,138,143 on traffic prediction highlight the relative strengths of different ML methods used in the state of the art.Szostak and Walkowiak 137 compared the efficacy of different ML methods, including FFNN, SVM, DT, random forest (RF), and linear discriminant analysis (LDA), for the problem of predicting source and destination for demands in a dynamic optical network setting.Furthermore, this was extended by including the prediction of traffic volume and holding time. 138They observed that the best classifier for such tasks was LDA. 137Additionally, Gui et al. 143 benchmarked their GCN-GRU based traffic prediction over several approaches, including LSTM, CNN, and GRU, and the results suggested that GCN-GRU has a greater prediction quality as compared to these other approaches.
As introduced in Sec.II, GANs are designed for realistic data generation and thus show potential in simulated traffic generation for optical networks.In a GAN-based traffic data generation scenario, 148 the objective of the generator is to transfer the random noise into the generated traffic data and attempt to make the characteristics of generated traffic data close to those of the real world traffic data.In contrast, the discriminator attempts to correctly determine whether the data are from the actual traffic dataset or the generated traffic dataset.Via intense competition, the discriminator and the generator are improved by each other and the generated traffic data become increasingly similar to the actual real world traffic data.

B. Core network parameter optimization
In this section, we intend to discuss the core optical network parameter optimization given in the frameworks of RL.Core optical networks play the most substantial role in the national and international communication infrastructure.They typically consist of flexible devices, such as the re-configurable optical add/drop multiplexers (ROADMs) and bandwidth variable transponders (BVTs).ROADMs are commonly used to transmit optical signals between different nodes, whereas BVTs are used to adapt a large set of core optical network parameters, such as signal modulation format, coding scheme, forward error correction overhead, and symbol rate, based on the current optical link requirements.Adopting the core optical network parameters is especially vital when attempting to maximize the ultimate network information throughput.However, this procedure requires the optimization of a large parameter space.
In addition, finding much more efficient use of core optical network spectral resources is essential to cope with ever-growing bandwidth demand.
Conventionally, in the case of fixed-grid WDM optical networks with a static traffic request assumption, the network parameters adjustment can be realized via adapting launch power per channel and signal modulation format with regard to stochastic system impairments in the physical layer. 149The typical core optical network physical layer impairments occurred between its nodes are the amplified spontaneous emission noise arising from the optical amplifiers and the nonlinear interference noise-like distortions induced by the four-wave mixing process in Kerr-type nonlinear media, i.e., in the optical fiber.In essence, the exact behavior of optical data signals between two nodes can be obtained numerically by solving the NLSE/Manakov equation via the SSFTM, when the step-size tends toward zero.However, the numerical solution is a comparatively time-consuming process, especially for wideband transmission systems.Currently, the most widely used physical layer impairments models are the family of the so-called Gaussian noise (GN) models, which commonly rely on the first-order perturbation theory. 66Moreover, under fairly reasonable assumptions, these models admit analytical closed-form approximations that significantly speed up the evaluation of the physical layer impairments.The resource allocation problem in the case of a single flexible-grid fiber link via the GN model closed-form approximation was considered in Ref. 150.Here, it is also worth mentioning that the possibility of quickly performing physical layer impairments estimations is essential regardless of the type of optimization frameworks.
RL has recently appeared as an alternative to conventional approaches, such as integer linear programming (ILP), 151,152 and heuristics, such as simulated annealing, k-shortest path routing and first-fit 153 and the genetic algorithm (GA). 154Generally speaking, RL is capable of efficiently overcoming a wide class of complex optimization problems. 155However, in the context of core optical networks, RL cannot be applicable straight away, as it must be generalized to learn over arbitrary network topologies with dynamically changing scenarios, such as network topology, traffic, routing, and link failures.Over the last few years, some initial works have suggested deep RL for solving various resource allocation and dynamic routing problems in core optical networks, [156][157][158][159] in which the advantages of using RL-enabled methods over traditional heuristic optimization algorithms were emphasized.
Yet, more interesting examples of using an RL framework for maximizing the point-to-point link capacity by means of adjusting controllable parameters in core optical networks have been recently reported in Refs.160 and 161, where the heuristic GA based results were used as a performance benchmark.The predicted performance of these two approaches remains very similar.However, after an initial training phase, the computation time of BVT parameters optimization to maximize the overall network throughput based on the RL approach is up to 1 second on average, while traditional heuristic algorithms may take in the order of minutes to hours.Additionally, preliminary investigations into network routing and parameter optimization show promising potential in leveraging the ability of GNNs to learn and model graph-structured information. 162,163Such models are able to generalize over arbitrary network topologies, routing schemes, and traffic intensity.

V. FUTURE DIRECTIONS
A number ML-driven future research directions are emerging within optical networks across both the physical layer and the network layer.In this we outline selected future directions within the physical layer, the network layer, and those spanning both layers.

A. Physical layer
An emerging theme within applied ML that is interesting in the context of optical networks is explainable ML, 32 a subset of explainable artificial intelligence 164 that aims to make the processes by which ML algorithms make decisions more understandable to humans.Optical networks are operated with high availability, meaning that light paths must stay within the accepted QoT ranges often at least 99.999% of the time, which translates to just over 5 minutes of downtime per year. 165This is enforced by service level agreements, which mean that operators must deliver the quality of service that customers have paid for.As a result, ML approaches deployed on optical networks must meet the stringent reliability requirements that are already satisfied by conventional techniques.Thus, understanding how ML algorithms work is crucial for adoption.Both post-hoc explainability methods and inherently explainable ML approaches have potential to yield substantial benefits for ML applied within optical communication networks.There are now open-source libraries that provide implementations of post-hoc techniques, 166 making their application convenient.It may be better to have a more easily interpretable model with slightly worse performance in some situations if operators can understand how it makes decisions and therefore can be more confident in its reliability.Additionally, probabilistic ML methods such as GPs provide well-quantified predictive uncertainties that can aid with the interpretation of ML model predictions, which would be greatly beneficial for many applications of ML within optical networks.
Another interesting avenue for future research is the combination of physical models with ML, so as to embed our knowledge of system physics into models such as NNs and GPs, as discussed in Sec.III A. For example, physics-informed ML approaches to QoT estimation can allow us to train models with fewer measurements of the system and enhance model explainability.Additionally, we can use our knowledge of the physics to design more effective model architectures.For example, NNs can be designed using the DBP structure for nonlinear noise mitigation.Certainly, the concept of utilizing the information available before we have seen the data and the data itself, rather than discarding this and relying solely on the data, presents an interesting research direction.
A further promising future research direction is digital twins-having been shown to be effective in other research areas, such as healthcare technology, manufacturing, and smart cities, 88 there are many open research questions for the development of digital twins for optical networks.The realization of true digital twins for optical networks, meaning a high-fidelity virtual copy of a deployed network, will require the amalgamation of models of all aspects of optical networks discussed in this Tutorial.It will also require access to high-quality datasets that are representative of deployed networks, as described above.Additionally, there is an important question regarding how fast digital twins will operate and whether a truly real-time digital twin is realizable.This depends on two factors-how dynamic installed networks become in the future and how operator confidence in ML approaches evolves over time.As networks become increasingly more dynamic, meaning that light paths are established and torn down with greater frequency, the time required to accurately measure the network may begin to form a bottleneck for how fast a digital twin can respond to a change in the network.Moreover, the time taken to retrain models may also limit this responsivity, meaning that online and transfer learning will likely be needed to ensure that ML models remain accurate as the network changes and to support rapid modeling of new light paths.Operator confidence in ML is also crucial as a true digital twin framework requires automatic control of the network based on data.As a result, explainability techniques are important for the development of digital twins as they will increase confidence in the ML models upon which the digital twins are built.
Furthermore, work is required to reduce the complexity of ML algorithms, in order for them to be successfully deployed with a reasonable use of computational resources.For example, in short reach equalizer applications, lower complexity ML is desirable due to the requirements for real-time equalization at high symbol rates.In general, ML techniques will need to have sufficiently low complexity in order to adapt to increasingly dynamic networks.One solution to this may be online learning, where ML models can be trained offline before deployment and adapt to monitoring data once deployed without completely re-training the model.An additional related challenge is the flexibility of ML algorithms-to what extent can the deployed models generalize to cover different network scenarios?One potential solution to this issue is transfer learning, which has been proposed as a method for increasing the flexibility of NNs for fiber nonlinear noise mitigation by re-using some of the initial trained network weights to adapt to a new situation.
An additional future direction is provided by hardware-driven ML approaches to equalization and nonlinearity compensation problems in optical networks.Due to the challenging requirements to operate at real-time data rates, the use of specialist hardware such as FPGAs is crucial for these applications.Low complexity implementations of ML architectures, such as DNN and RNN equalizers 104,110 and real-time nonlinearity compensation 126 discussed in Sec.III, present an interesting future direction for performing such signal processing tasks in next generation optical networks.

B. Network layer
As in the physical layer, explainable ML is a promising field of research within network layer applications.Similarly, reduction in ML algorithm complexity is also an interesting future direction for network layer applications, particularly for any methods which are required to work in an online scenario.
Obtaining sufficiently detailed datasets from deployed networks remains a significant challenge for ML research in optical networks.Such data may often be difficult to find, as network operators may not be able to grant researchers access to detailed network data without a non-disclosure agreement, due to competitionrelated concerns.In the cases where such data are provided, 167,168 it could still be insufficiently detailed to be of use.As discussed in Sec.IV, GANs show potential to address this issue to some extent TUTORIAL scitation.org/journal/appwith their ability generate larger datasets from small amount of input data.To this end, GANs have been successful in generating data that are indistinguishable from real world input data in optinetwork traffic generation applications 148 and numerous other applications in the computer vision domain. 169n additional promising research direction in network routing and parameter optimization is leveraging the ability of GNNs to learn and model graph-structured information to create models that are able to generalize over arbitrary network topologies, routing schemes, and traffic intensity. 162,163Furthermore, the preliminary works applying RL techniques in dynamic parameter optimization have shown strong potential, with faster response time and similar quality of solutions compared to conventional optimization approaches. 156To this end, it would be interesting to investigate the means of bringing the strengths of RL and GNNs together in a data-driven network routing and parameter optimization scenario.
Moreover, within traffic prediction and generation, future work includes extending the proposed methodologies to networks of different scales, such as core and access networks.Another potential direction is the introduction of novel methods that have been used successfully in time series forecasting problems in other domains, such as echo state networks, 170 and combining existing ML approaches to develop more effective hybrid methods.For example, hybrid models of GNNs and LSTMs could be investigated as these harness both the knowledge of the network structure and the temporal aspects of the traffic, respectively.Finally, integrating traffic prediction and simulation modules with other modules in a SDN setting will aid in achieving high performance in increasingly dynamic and flexible networks.

VI. CONCLUSIONS
In this Tutorial, we have outlined the key research challenges in optical networks that exist today, the ML techniques that have been proposed to solve these problems, and interesting works from the literature that have applied ML.We have introduced the crucial concepts required to navigate ML literature and highlighted techniques that are commonly used in optical networks: various forms of NNs, Bayesian approaches such as GPs, classifiers such as SVMs, and RL techniques such as deep Q-learning.In the physical layer, we have surveyed the literature applying ML to QoT estimation, digital twins, equalization for short reach networks, and nonlinear noise mitigation for long-haul systems.In the network layer, we have presented exemplary work tackling network traffic prediction and generation and the optimization of core network parameters.Thus, there has been a significant progress in ML applied to optical networks, with a vast range of methods utilized, each yielding benefits over previous approaches.There remain a number of interesting avenues for future research as discussed above, which will be crucial in delivering the next generation of optical networks and meeting the service requirements of the future.

NOMENCLATURE
Figure 5  describes an example GNN model for node-based prediction tasks.
FIG. 12. (a) Classical DBP structure with interleaving operations of CD compensation and nonlinear-phase derotation.(b) DBP structure as an ANN with interleaving linear and nonlinear operations.

TUTORIAL scitation.org/journal/app FIG. 3.
Diagram outlining the structure of an autoencoder model, adapted from the work of Li et al.

TABLE I .
ML to physical layer applications.

TABLE II .
ML approaches to network layer applications.