Machine learning approaches for the prediction of materials properties

We give here a brief overview of the use of machine learning (ML) in our field, for chemists and materials scientists with no experience with these techniques. We illustrate the workflow of ML for computational studies of materials, with a specific interest in the prediction of materials properties. We present concisely the fundamental ideas of ML, and for each stage of the workflow, we give examples of the possibilities and questions to be considered in implementing ML-based modeling.


I. INTRODUCTION
The pace of systematic materials discovery has quickened in the last decade. The number of studies systematically exploring various families of materials, with the goal of discovering existing materials with unsuspected properties, or designing novel materials with targeted properties, is growing at an astounding rate. Databases of experimental structures-in particular, crystalline structurescontinue to grow at a steady pace and are complemented with larger and larger databases of physical and chemical properties. Highthroughput experiments and combinatorial materials synthesis are aided by robotics and artificial intelligence, performing reactions and analysis faster. On the computational side of things, molecular simulations have expanded in scale, allowing scientists to predict the structure and properties of complex materials even before they are synthesized. The prediction of compound properties with high accuracy can be coupled with high-throughput screening techniques to help search for new materials. Yet, despite the advances in the computational power, computational methods-whether at the quantum or classical level-are still relatively time consuming and can hardly explore the properties of all possible chemical compositions and crystal structures. In order to reach this goal of systematic exploration of chemical space and to help leverage the large-scale databases of structures and properties that are nowadays available, computational chemistry and materials science are turning more and more often to machine learning (ML), a subset of artificial intelligence (AI) that has seen tremendous developments in recent years and widespread application across all fields of research.
The main idea of artificial intelligence emerged in the 1950s when Turing wondered if a machine could "think." 1 The term "artificial intelligence" (AI) was first coined by John McCarthy in 1955 and is defined as the set of theories and techniques implemented in order to create machines capable of simulating intelligence. In other words, AI is the endeavor to replicate the human intelligence in computers. In 1959, Samuel produced computer programs that were playing checkers (drafts) better than the average human and that could learn to improve from past games. 2 Since then, AI and dataintensive algorithms have seen such an important development that they are sometimes called the "fourth paradigm of science" 3 or the "fourth industrial revolution." AI is now routinely used in different fields: face recognition, image classification, information engineering, linguistics, psychology, and medicine, and it has impact in the fields of philosophy and ethics.
AI-powered machines are usually classified under two broad categories: general and narrow. The artificial general intelligence (AGI) is a machine that can learn to solve any problem that the human intellect can solve. Also referred to as "strong AI" or "full AI," it is currently hypothetical, the kind of artificial intelligence that we see in science fiction movies. The creation of AGI is an important goal for some AI researchers, but is an extremely difficult quest and generally considered too complex to be achieved in FIG. 1. Simplified overview of a machine learning workflow. The machine learning model is trained on input data gathered from multiple databases. Once it is trained, it can be applied to make predictions for other input data.
the near future. 4 In contrast, narrow AI (or "weak AI") is a kind of artificial intelligence focused on performing specific tasks, defined in advance. Narrow AI has seen a very large number of successful realizations of artificial intelligence to date, sometimes in applications where the machine seems intelligent (in the human way) and sometimes hidden under the hood. Much of these successes of narrow AI have been made possible by advances in machine learning (ML), in general, and in deep learning (DL), more specifically in the past few years.
Machine learning aims at developing algorithms that can learn and create statistical models for data analysis and prediction. The ML algorithms should be able to learn by themselves-based on data provided-and make accurate predictions, without having been specifically programmed for a given task. Beyond theoretical developments, recent years have seen rapid advances in the application of machine learning, not only by computer scientists and experts in the development of AI algorithms but also by other researchers in different fields who adopt these techniques for their own purposes. Among many other fields of research, chemical and materials sciences have been impacted by the application of machine learning to accelerate certain computational tasks or to solve problems for which traditional modeling methods were ill-suited. Deep learning, a subset of machine learning based on artificial neural networks (ANNs), promises to escalate the advances of AI even further.
In this paper, we set out to illustrate the workflow of machine learning in the computational materials context (schematized in Fig. 1) and give examples at each stage of the possibilities and questions to be considered in implementing ML-based modeling. In this very active and rapidly expanding field of research, we will try to highlight some-but not all-of the machine learning techniques that have been successfully applied in real applications for computational chemistry of materials, either as they are representative of what is done in the field or because they represent recent and exciting developments. We will also discuss the specific contributions made to our field by deep learning studies, although they are currently more limited. The goal of this paper is to provide a brief overview to chemists and materials scientists, but we do not try to be exhaustive in our discussion of the state of the art. For a full review on machine learning for molecular and materials science, we refer the reader to the excellent introductory yet thorough review of Butler et al. 5

A. Gathering data
As stated in the Introduction, machine learning algorithms are trained on existing datasets to learn and improve. In order to create accurate models, the size and quality of the datasets used for training play a crucial role. This identification, gathering, or creating (in some cases) of the training dataset is the first step of the machine learning workflow and will, of course, heavily depend on the goal of the model you want to train. For generic purposes, in order to "learn by doing" the various steps in implementing a ML workflow, one can find free datasets through platforms such as Kaggle (https://www.kaggle.com), the UC Irvine Machine Learning Repository (https://archive.ics.uci.edu/ml), or on governmental or agency websites that gather and promote open data (https://www.data.gouv.fr in France, https://www.data.gov/ in the USA).
When it comes to materials sciences, there are a number of available datasets that have been published and validated in the scientific literature. Many of them are open and publicly available to any potential user, but others have stricter licensing terms or require a paid subscription for access to some or all features. On practical terms, they mainly differ in the data featured within the dataset, with two main categories: databases that focus (exclusively or predominantly) on structural information and databases that focus on physical or chemical properties of materials. Table I presents a short list of selected databases in materials sciences.
The first category, i.e., databases of structures of materials, is probably the most well-known and historically the most As should be apparent when reading the above list, it is important to be aware of an important bias on how these databases cover the field of materials science: they are all limited to crystallographic structures. While this is obviously linked to the nature of the determination of the structure and its representation, it is important to be aware of such a bias. This is only one example-and there are many others-of how databases exhibit, by the very own choice of their scope and the representations chosen, a biased representation of the wide scope of the field of materials sciences. It is, therefore, necessary to be aware of the biases in the datasets one is using, both implicit and explicit.
Beyond structural databases, the past few years have seen the development of another category, with a rapid growth in the number of structure-property databases available-often, again, with a specific focus on a particular class of materials or specific properties. The existence of such databases with a large amount of data, most of which are open access and collaborative, presents a significant opportunity to train and to validate new machine learning models. We list some here, whose characteristics are summarized in Table I; the choice made is not to try and be exhaustive (which would necessarily fail, given the large and ever-growing number of existing databases) but to highlight those that appear commonly used and have easy access and are well-documented for newcomers to the materials discovery field. Among the largest databases, we can cite the Materials Project 9 (https://materialsproject.org/) for inorganic materials, the AFLOWLIB 10 "Automatic Flow for Materials Discovery" (http://aflowlib.org/), and the Open Quantum Materials Database 11 (http://oqmd.org). In addition to these generic sources, there are also specific databases of computed properties for specific classes of materials, such as the Harvard Clean Energy Project 12 (previously at https://cepdb.molecularspace.org, currently being migrated) for organic solar materials, TE Design Lab 13 (http://tedesignlab.org) for thermoelectric materials, and NREL Materials Database 14 (https://materials.nrel.gov) for renewable energy materials. Finally, other online portals allow the sharing and exchange of computational data on materials from different origins, resulting in a more heterogeneous dataset, such as the Materials Cloud 15 (https://www.materialscloud.org/).
Most of these databases are accessible through both a web front-end, for simple exploration and visualization purposes, and an Application Programming Interface (API). An API is a web interface with well-documented behavior, whose queries and results are machine-readable in an agreed-upon format, making them wellsuited for automated exploitation. These API are typically accompanied with a software layer to facilitate integration into projects, such as the Python Materials Genomics (pymatgen) 16 Python package that integrates with the Materials Project RESTful API, or the Automated Interactive Infrastructure and Database (AiiDA). 17 Finally, we would be remiss if we did not note that it is also possible to generate data to form a machine learning training set "on the fly," by performing high-throughput calculations on materials of interest. When this is done, it is then expected that the dataset produced is published alongside the work as supplementary material or submitted to an online data repository.

B. Cleaning and preprocessing data
We have described above how to use existing datasets of materials structures and properties (or generate one's own), yet these data cannot be used directly in the original format and loaded "as is" in a machine learning workflow. When it comes to large datasets, four points are considered crucial in big data workflows, called the "four V's," and they are also relevant in machine learning methodologies: (i) volume, the amount of data available; (ii) variety, the heterogeneity of the data in both form and meaning; (iii) veracity, the knowledge of the uncertainties associated with each data point; and (iv) velocity, how fast the data are generated and have to be treated-not usually an issue in our workflows, which do not have to work in real time.
Therefore, data have to be homogenized and cleaned before it can be used. This means identifying possible erroneous, missing, or inconsistent data points (outliers), using criteria based on physical or chemical knowledge. This cleaning and homogenization of the data is a key step in order to build more accurate predictors through machine learning. The need for this may depend on the workflow followed: for example, some ML algorithms are more robust than others in the presence of outliers. Some algorithms (such as the Random Forest family) do not support null values in their input, while others can handle those.
To give one example of the necessity of this curation of the training dataset, we recently performed a large-scale exploration of the elastic properties of inorganic crystalline materials available on the Materials Project database. By analyzing the elastic data present, we quantified that out of 13 621 crystals, only 11 764 structures were mechanically stable, while 1857 (around 14% of materials in the database) had elastic tensors that indicated mechanical instability (and were, therefore, unusable for further analysis). 18 Other materials had elastic moduli that were mathematically acceptable but unphysically large and those needed to be removed as well before using the dataset.

C. Representing data
Once data have been cleaned up and homogenized, the next step in the machine learning workflow is the encoding of these data into a set of specific variables, which will be manipulated directly by the ML algorithm. The data collected are often in a raw format and will need to be converted into a format suitable for learning procedure, usually as a series of scalar or vector variables for each entry of the dataset. This step can include the transformation of existing data (such as physical properties) by rescaling, normalization, or binarization to bring it to such a state that the algorithm can easily parse it. Some basic preprocessing techniques are widely available in ML software, such as the MinMaxScaler or StandardScaler methods in scikit-learn. 19 In all cases, the effect of this preprocessing of the data needs to be studied carefully: it is important to mention that sometimes algorithms can deliver better results without preprocessing, and with excessive preprocessing, it may not be possible to identify the crucial features that will give the best performance for the target variable.
When the input data consist of chemical structures, the choice of representation of the data is not always obvious: chemical compounds and materials are complex three-dimensional objects, whose direct representation as vectors of coordinates may not be efficient as input for the ML workflow. This question of the best representation of the data for the learning algorithm is called featurization or feature engineering. It is a very active area of research, in particular, when it comes to describing chemical structures. The two main goals of feature engineering are (i) preparing the input data in a form that the ML algorithms will work well with (and that can depend on the specific characteristics of the algorithm chosen); (ii) improving the performance of ML models, by using our knowledge of the materials and their important features (chemical intuition) in the building of the input data.
The chemical information about a given system can then be transformed into a series of the so-called descriptors, which encode the chemical information, creating a new input that should describe the key features of the dataset in a way that allows for the ML algorithm to train efficiently. Many different mathematical representations are used as descriptors for chemical and materials structures (and their properties). We will cite here examples for molecular structures such as Coulomb matrix, 20 SMILES, 21 bag of bonds, 22 molecular graphs, 23 and BAML (bonds, angles, machine learning; using bonds, angles, and higher order terms). 24 Among the representations used for crystals, we find representations such as translation vectors, fractional coordinates of the atoms, radial distribution functions, 25 Voronoi tessellations of the atomic positions, 26 and property-labeled materials fragments. 27 D. Machine learning models

Supervised learning
Let us now move to fourth step of the machine learning workflow: the "learning" part itself, i.e., the training of the ML algorithm. Using the curated and pre-processed dataset as input, there are, then, three main categories of ML models: supervised, unsupervised, and semi-supervised (see Fig. 2). In supervised learning, the Then, the goal of the ML algorithm is to learn, from these training data, the mapping function from the input (the structures) to the output (the properties). The main goal of the machine learned algorithm in a supervised learning approach is to be able to make a prediction for new data, with an acceptable level of accuracy, once it has been trained. It is useful to know that supervised learning problems can be broadly categorized into two main types: regression and classification techniques. Both of them have as a goal the construction of a model that can predict values from the available variables. The only difference between the two is the type of the output variable: in a regression-type prediction problem, the variable that is to be predicted is continuous, taking real values: melting point, bandgap, elastic modulus, etc. Regression learning algorithms include linear regression, lasso linear regression, ridge regression, elastic net regression, Gaussian process regression, and classification and regression trees. The simplest of these algorithms is the linear regression (LR), which-similar to the common linear regression in 2D graphs-tries to create the best linear model based on the descriptors given. If the model is complex, a lasso linear regression algorithm can be used instead, which is a modification of LR where the loss function is modified to minimize the complexity of the model.
On the other hand, for classification problems, the algorithm has to predict a categorical variable, i.e., attribute a label to the input data. In the simplest case, these categories are reduced to two in a binary variable, such as: is the material conducting or isolating? and is it porous or not? (or in another context: is this email a spam or not?). A large number of different classification algorithms can be used: logistic regression, linear discriminant analysis, k-nearest neighbor, naïve Bayes, classification and regression tress, support vector machine, and kernel ridge regression (KRR). To given an example, Ghiringhelli et al. used KRR with descriptors derived from the energy levels and radii of valence orbitals to predict crystalline arrangements between zinc blende and wurtzite-type crystal structures. 28

Unsupervised learning
Unsupervised learning works in a completely different way, trying to draw inferences about the input data without any corresponding output variables: it is used to look for previously undetected patterns in data with minimal human supervision or guidance. It can, for example, act on unlabeled data to discover the inherent groupings. The ML algorithm tries to identify trends that could be of interest to rationalize the dataset and present the available data in a novel way. One family of the unsupervised learning algorithm is the clustering or cluster analysis: there, the ML workflow will split the data into several groups (or clusters) of records presenting similar features, with no prior assumption on the nature of these groups (unlike, for example, in supervised learning classification tasks). Classical methods used for clustering include Gaussian mixtures, k-means clustering, hierarchical clustering, and spectral clustering. Other types of unsupervised learning method are association rule learning and principal component algorithms that aim to establish relationships between multiple features in a large dataset, something that is difficult to do by hand. Popular algorithms for generating association rules have been proposed, such as A priori, Eclat, and FP-Growth (Frequent-Pattern).
When it comes to applications in chemistry and materials science, supervised machine learning is a lot more common, but there are some examples of unsupervised learning, nonetheless; for a recent perspective on this specific topic, see Ref. 29. Saad et al. applied both the supervised and unsupervised ML techniques to predict the structure and properties of crystals (such as the melting point) for binary compounds. 30 Supervised ML models have been trained to reproduce the LUMO and HOMO for organic solar cells 31 and to predict key thermodynamic parameters such as adsorption energy, 32 activation energy, 33 and active site 34 in catalytic processes. Isayev et al. developed predictive Quantitative Materials Structure-Property Relationship (QMSPR) models through machine learning, in order to predict the critical temperatures of known superconductors, 35 and Woo and co-workers established the QMSPR model to achieve high-throughput screening of metal-organic frameworks (MOFs) to capture CO 2 by adsorption, relying on machine learning to explore the large dimensionality spanned by their exceptional structural tunability.
One disadvantage of supervised ML algorithms lies in the acquisition of labeled data, an expensive process requiring especially when dealing with large amounts of data. Unlabeled data, on the other hand, are relatively inexpensive and easy to collect and storebut applications of unsupervised ML in materials sciences have been relatively limited. An alternative exists in the form of a third training ML model named semi-supervised, which is halfway between supervised and unsupervised learning, as its name implies. In such a workflow, we have a large amount of input data and only a limited amount of corresponding output data. Semi-supervised ML aims at learning from the labeled part of the dataset, training an accurate model. First, the workflow will use the unsupervised learning algorithm to identify and cluster similar data. Then, it will use supervised learning techniques to train and make prediction for the rest of the unlabeled data. Semi-supervised learning algorithms make three assumptions about the data, which somewhat restrict their applicability: (i) continuity assumption: the close points are more likely to share a label; (ii) cluster assumption: data can form discrete clusters and points in the same cluster are more likely to share an output label; and (iii) manifold assumption: the data lie approximately on a manifold of much lower dimension than the input space. Semisupervised algorithms are widely used in text and speech analysis, internet content classification, and protein sequence classification applications.
Semi-supervised learning was used by Court et al. to create a materials database of Curie and Néel temperatures for 39 822 by text mining a corpus of 68 078 chemistry and physics articles, applying natural language processing and a semi-supervised relationship extraction algorithm to obtain the values (and units) of the properties from the texts. 36 Similarly, Huo et al. demonstrated the efficiency and accuracy of semi-supervised machine learning to classify inorganic materials synthesis procedures from written natural language. 37 Recently, Kunselman et al. used semi-supervised learning methods to analyze and classify microstructure images, training their model on a dataset where only a fraction of the microstructures was labeled initially. 38

Applications
Most of the applications of ML in chemical and materials sciences, as we have said, feature supervised learning algorithms. The goal there is to supplement or replace traditional modeling methods, at the quantum chemical or classical level, in order to predict the properties of molecules or materials directly from their structure or their chemical composition. To give some recent examples in an area our group has worked in, ML has been used in providing a better understanding and prediction of the mechanical properties of crystalline materials. In 2016, de Jong et al. using supervised ML with gradient boosting regression have developed a model to predict the elastic properties (such as the bulk modulus K and shear modulus G) for a large set of inorganic compounds. 39 These authors used 187 descriptors for the materials, including a diverse set of composition and structural descriptors, and training data from quantum chemistry calculations at the Density Functional Theory (DFT) level. The trained predictor was then used to provide predictions of K and G for a large range of inorganic materials outside of the training dataset, and these predicted values are now available (as estimates) for materials in the Material Project, 9 both through the API and on the website.
Our research group was applying the same idea on a narrower range of materials, trying to confirm that for a given chemical composition, geometrical descriptors of a material's structure could lead to accurate predictions of its mechanical features: we used local, structural, and porosity-related descriptors. Evans and Coudert 40 trained a gradient boosting regressor algorithm on data for 121 pure silica zeolites 41 (SiO 2 polymorphs) and also used it to predict K and G elastic moduli of 590 448 hypothetical frameworks. The results highlighted several important correlations and trends in terms of stability for zeolitic structures. Later, in Gaillac et al., we expanded this ML study to look into anisotropic mechanical properties, which are typically more difficult to model. We obtained an algorithm for the prediction of auxeticity and Poisson's ratio of more than 1000 zeolites. 42 Beyond the applications described above for prediction of molecules or materials properties, machine learning has been used at another level, in order to improve existing computational methods. One of the areas where it has been done is in order to improve the exchange-correlation functional in density functional theory (DFT) calculations, for example, in order to provide a better description of weak chemical interactions and highly correlated systems. In this area, much effort has been spent trying to leverage machine learning to produce a universal density functional, 43,44 to solve the Kohn-Sham equations, 45 to optimize DFT exchange-correlation functionals, 46,47 and to create adaptive basis sets. 48 In the realm of classical molecular simulation, machine learning can be used to optimize interatomic potentials (a.k.a. force fields) with the predetermined analytical form 49,50 or to create de novo force fields, 51-53 for both molecules and materials.
We should note here that machine learning can also be applied to discover and design entirely new compounds, creating novel opportunities for computationally assisted discovery of materials. Designing materials with targeted physical and chemical properties is recognized as an outstanding challenge in materials research. Using kernel regression, Calfa and Kitchin predicted the electronic properties of 746 binary metal oxides and elastic properties of 1173 crystals. 54 Then, they used the special features to design a new crystal with an exhaustive enumeration (EE) algorithm that evaluated all the possible combinations of crystals from the dataset. Based on the electronic properties of binary metal oxides, the authors obtained 1 153 504 combinations that should be iterated. Faber and co-workers identified 128 novel structures through the development of a ML model that trained to reproduce the formation energies of two million combinations of elements presenting the ABC 2 D 6 formula. The 128 new structures are added later to the Materials Project database. 55 Other applications in the material exploration include the design of novel catalysts 56 and novel cathode materials to improve the performance of lithium ion batteries. 57

E. Learners
In this section, we discuss in a bit more detail an important (but more technical) part of the machine learning workflow: the choice of learning algorithms or learners. We have already mentioned in passing above the names of a few of these, which can be applied depending on the type of the data and the underlying problem to be solved. This choice is a crucial step in any ML workflow as the selection of algorithm plays a key role in the accuracy of the prediction. 58 Many algorithms are readily accessible for non-expert users, with packages written in Python (scikit-learn, Keras, PyTorch), C++ (mlpack, Tensorflow), R (caret), and others. We highlight here some of the possible choices of learners, in an overview which is not remotely exhaustive, but wants to give the reader a glance of the diversity of the methods available.
The family of k-nearest neighbor (k-NN) algorithms can be used for classification and regression tasks in supervised ML. The k-NN algorithm assumes that similar objects in the data are near each other. For a given observation X that we want to predict, the k-NN algorithm will look for the k points closest in the dataset; then, it will use the output variable associated with these nearest neighbors to predict the value of X. The upsides of k-NN are that the model is simple and easy to implement and that it is non-parametric, with no need for tuning several hyperparameters. On the downside, it gets significantly slower as the volume of the data increases.
Decision trees (DTs) are another family of learners that can handle both classification and regression supervised ML (forming classification trees and regression trees, respectively). The use of decision trees in machine learning represents an option that is simple to understand and interpret, as the trees can be explicitly visualized. In the tree structure, the root of the tree is the input data, and each branch represents a possible decision. The input data are broken down into smaller and smaller subsets in the tree, with splitting rules implemented in each internal node of the tree based on the data. The leaves of the tree represent the output of the algorithm. The choice of the DT model is important to avoid overfitting (with unnecessarily complex trees)-this can be achieved by mechanisms such as setting the maximum depth of the tree and the minimum number of samples required at a leaf node. Different types of decision tree-based learners exist, such as Random Forest (RF) and Gradient Boosting Decision Tree (GBDT). RF uses averaging to reduce the overfitting and improve the predictive accuracy, while GBDT tries to correct the error of previous trees by merging several trees on smaller depths. scitation.org/journal/apm Another family is that of the naïve Bayes classifiers, supervised ML algorithms for classification. They are based, as their name indicates, on Bayes' theorem with an assumption of independence between the features. They have been studied and developed for a long time, are easy to employ, and are extremely scalable. Different kinds exist, depending on the nature of the data. Their main downside is the assumption of independence of the predictors, which rarely holds in complex use cases.
As the last family we will cite here, kernel methods are a class of algorithms for pattern analysis. At their core, they rely on the use of kernel functions, which are applied to the input data to map the original nonlinear observations into a higher-dimensional space in which they become separable. That feature space is implicit because the coordinates of the data in this space are never directly computed, but only the inner product between pairs of data, which is computationally less expensive. Perhaps the best known and most widely used algorithm in this family is the support vector machine kernel ridge regression.

F. Model evaluation
As we explained above, the main goal of machine learning is to train and generate an efficient computational model, whose predictions will be accurate. This accuracy should be confirmed, in order to check that the model captures correctly the underlying patterns in the data, but cannot be validated solely on the results obtained from the training dataset. The best way to check the accuracy is to assess the performance of the trained ML model on data that was not included in its training dataset. However, starting from a dataset of a given size, there are statistical techniques that are better than simply splitting the data into two sets (training and validation), called cross-validation techniques. One of the most used cross-validation methods is the k-fold cross-validation. In this method, the dataset is divided into a number of subsets. Then, during the ML training, the model is trained using all the subsets as the training sets but leaves one subset for later testing. The process of training and evaluation is repeated several times, and each time a different subset of data is used for validation.
Among the problems that have to be checked in the evaluation of the ML model is the agreement between the level of complexity of the model and that of the data. In the case where the data are not sufficiently detailed or the model is too simple, the resulting model can have a bias, a situation named underfitting. On the contrary, if the model is too complex, with a large number of parameters, overfitting can occur. To produce an optimal model, a balance to avoid both underfitting and overfitting by adjustment of the hyperparameters is crucial. This necessary step of tuning the machine learning hyperparameters to select the optimal values is not always easy as it requires systematic searches and patience.

III. DEEP LEARNING
As stated before, machine learning techniques are a form of weak AI and, therefore, not fully autonomous and require some guidance, e.g., in the adjustment of hyperparameters. To go beyond the traditional ML approaches, deep learning (DL) methods were developed that try to mimic more closely some aspects of human cognition. This allows them to outperform other ML algorithms in accuracy and speed, without need for manual intervention from the programmer. DL is a subtype of ML that runs its inputs through a biologically inspired artificial neural network (NN) architecture. Over time, it has been established that NN outperform many other algorithms in accuracy and speed by their strong ability to capture the relevant information from a large amount of data. Deep learning is, in particular, capable of modeling and processing very complex nonlinear relationships.
There are many variants of deep learning methods available, including convolutional neural networks (CNNs), recurrent neural networks (RNNs), artificial neural networks (ANNs), and deep neural networks (DNNs). The neural networks they rely on contain artificial neurons arranged in multiple layers such that each layer communicates only with the layers immediately preceding and following (see Fig. 3). Information travels in the NN from the first layer, or input layer (which receives the external data), to the last layer, or output layer (which produces the final result), through several hidden layers in between. The number of hidden layers depends on the complexity of the problem to be solved.
In each of the hidden layers, the neurons receive the input signal from others neurons, process it by combining the input with their internal state, and produce an output signal. The neural network links these neurons through connections, providing the output of one neuron as an input to another neuron. Each neuron can have multiple input and output connections, to which are assigned weights, forming the overall layer of the NN. The learning process involves the adaptation of the network, i.e., the weights of the connections, by minimizing the observed errors in the output of the neural network. Because of this very generic and selfadapting architecture of the NN, deep learning reduces need for feature engineering and can identify and work around defects that would be difficult to spot in other techniques. However, the training of artificial neural networks requires a very large amount of data to train accurately and is computationally expensive due to the large number (millions or more) of parameters to optimize during training.
Several research groups have proposed applications of deep learning to problems in chemical and materials sciences. 59 For example, Willighagen et al. used supervised self-organizing maps (a kind of ANN) to explore large numbers of experimental and simulated crystal structures, in order to visualize structure-property relationships. 60 Using an ANN implemented in the open-source PyBrain code, 61 Ma et al. trained a model with a set of ab initio adsorption energies and electronic fingerprints of idealized bimetallic surface (spatial extent of metal d-orbitals, atomic radius, ionization potential, electron affinity, and Pauling electronegativity). 62 This model was able to capture complex nonlinear adsorbatesubstrate interactions as it is applied to the electrochemical reduction of carbon dioxide on metal electrodes.
Rule-based expert systems (rules are applied to the reactants to obtain the product in reaction prediction or to the product for retrosynthesis) cannot predict outside of their knowledge and often fail because they ignore the molecular context, which leads to reactivity conflicts. To overcome this problem, deep learning techniques have been used to predict chemical synthesis routes by combining NN with rule-based expert systems. Using this combination, Segler and Waller have ranked the candidate synthetic pathways by the computation of mean reciprocal rank (MRR). 63 This model was trained on 3.5 × 10 6 reactions with a success rate of 95% in retrosynthesis and 97% for reaction prediction. In another work, Cole et al. have determined the probability of the predicted product using more than 800 000 organic and organometallic crystal structures in the CSD. 64 Deep learning is also used in the study and discovery of drug-like molecules, e.g., Gómez-Bombarelli et al. have estimated the chemical properties from the latent continuous vector representation of the molecule using the RNN method. 65 The model they developed allows us to generate new molecules for efficient exploration and optimization.

IV. ARTIFICIAL INTELLIGENCE IN THE LAB
While this paper is focused on providing an introductory description of machine learning approaches for the prediction of chemical systems, in general, and materials properties, more specifically, we want to end it by noting that the use of artificial intelligence techniques in chemistry and materials science is much broader than machine learning and its computational applications-and it provides a lot of exciting avenues for research in the near future. 66 We refer the reader to Ref. 67 for an in-depth and very insightful review of the multiple avenues of research opened by artificial intelligence in the field of synthetic organic chemistry.
One particular area of recent achievements for artificial intelligence in chemistry is its integration into chemistry labs achieved through robotics. 68 Robotic synthesis based on flow chemistry takes high-throughput discovery to an entirely new scale-where chemical syntheses can be described through standardized method descriptions, i.e., "source code for chemistry," which is then compiled for the specific hardware of a synthesis robot. 69 Furthermore, this allows high-throughput synthesis and characterization to be tightly coupled with computational screening procedures. 70 In this approach, it is, thus, possible to leverage an artificial intelligence algorithm to propose synthetic routes, coupled with a robotic microfluidic platform to realize the synthesis and characterize its results. 71

ACKNOWLEDGMENTS
A large part of the work discussed in this article requires access of scientists to large supercomputer centers. Although no original calculations were performed in the writing of this paper, we acknowledge GENCI for high-performance computing CPU time allocations (Grant No. A0070807069). This work was funded by the Agence Nationale de la Recherche under the project "MATAREB" (Grant No. ANR-18-CE29-0009-01).

DATA AVAILABILITY
Data sharing is not applicable to this article as no new data were created or analyzed in this study.