Perspective : Materials informatics and big data : Realization of the “ fourth paradigm ” of science in materials science

Our ability to collect “big data” has greatly surpassed our capability to analyze it, underscoring the emergence of the fourth paradigm of science, which is datadriven discovery. The need for data informatics is also emphasized by the Materials Genome Initiative (MGI), further boosting the emerging field of materials informatics. In this article, we look at how data-driven techniques are playing a big role in deciphering processing-structure-property-performance relationships in materials, with illustrative examples of both forward models (property prediction) and inverse models (materials discovery). Such analytics can significantly reduce time-to-insight and accelerate cost-effective materials discovery, which is the goal of MGI. C 2016 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). [http://dx.doi.org/10.1063/1.4946894]


INTRODUCTION
The field of materials science relies on experiments and simulation-based models to understand the "physics" of different materials in order to better understand their characteristics and discover new materials with improved properties for use in society at all levels. Lately, the "big data" generated by such experiments and simulations has offered unprecedented opportunities for application of data-driven techniques in this field, thereby opening up new avenues for accelerated materials discovery and design. The need for such data analytics has also been emphasized by the Materials Genome Initiative (MGI), 1 which envisions the discovery, development, manufacturing, and deployment of advanced materials twice as fast and at a fraction of the cost.

Four paradigms of science
In fact, these developments in the field of materials science are along the lines of how science and technology overall have evolved over the centuries. For thousands of years, science was purely empirical, which here corresponds to metallurgical observations over the "ages" (stone, bronze, iron, steel). Then came the paradigm of theoretical models and generalizations a few centuries ago, characterized by the formulation of various "laws" in the form of mathematical equations; in materials science, the laws of thermodynamics are a good example. But for many scientific problems, the theoretical models became too complex with time, and an analytical solution was no longer feasible. With the advent of computers a few decades ago, a third paradigm of computational science became very popular. This has allowed simulations of complex real-world phenomena based on the theoretical models of the second paradigm, and excellent examples of this in materials science are the density functional theory (DFT) and molecular dynamics (MD) simulations. These paradigms of science have contributed in turn towards advancing the previous paradigms, and today these are popular as the branches of theory, experiment, and computation in almost all scientific domains. The amount of data being generated by these experiments and simulations has given rise to the fourth paradigm of science 2 over the last few years, which is (big) data driven science, and it unifies the first three paradigms of theory, experiment, and computation/simulation. It is increasingly becoming popular in the field of materials science as well and has, in fact, led to the emergence of the new field of materials informatics. Figure 1 shows these four paradigms of science.

Big data
Before going further, it would be useful to expand on the concept of big data, and what it means specifically in the context of materials science. The "bigness" (amount) of data is certainly the primary feature and challenge, but several other characteristics can make the collection, storage, retrieval, analysis, and visualization of such data even more challenging. For example, the data may be from heterogeneous sources, may be of different types, may have unknown dependencies and inconsistencies within it, may have parts that are missing or not reliable, may be generated at a rate that could be much greater than what traditional systems can handle, may have privacy issues, and so on. These issues can be summarized by the various Vs associated with big data-volume, velocity, variety, variability, veracity, value, and visualization. Of these, the first three (volume, velocity, and variety) are specific to big data and others are features of any data, including big data. Further, each application domain can also introduce its own nuances to the process of big data management and analytics. It is worth noting that in many areas of materials science, until recently, there was more of a no data than a big data problem, in the sense that open, accessible data have been rather limited; however, recent MGI-supported efforts 3,4 and other similar efforts around the world are promoting the availability and accessibility of digital data in materials science. These efforts include combining experimental and simulation data into a searchable materials data infrastructure and encouraging researchers to make their data available to the community. Thanks to such efforts, it is fair to say that the sheer complexity and variety in materials science data becoming available nowadays requires the development of new big data approaches in materials informatics. For example, there are numerous kinds of experimental and simulation-based materials property data (e.g., physical, chemical, electronic, thermodynamic, mechanical, structural), engineering/processing data (e.g., heat treatment), image data (e.g., electron backscatter diffraction), spatio-temporal data (e.g., tomography, structure evolution), unstructured textual data (materials science literature), and so on. Of course, several of these types of data are often coupled together. Several excellent review articles on big data in materials science are available in the literature. [5][6][7] Processing-structure-property-performance (PSPP) relationships So what can materials informatics and big data do for a real-world materials science application? It is well-known that the key of almost everything in materials science relates to PSPP relationships, 8 which are far from being well-understood. Figure 2 shows these PSPP relationships, where the deductive science relationships of cause and effect flow from left to right, and the 053208-3 A. Agrawal and A. Choudhary APL Mater. 4, 053208 (2016) FIG. 2. The processing-structure-property-performance relationships of materials science and engineering, and how materials informatics approaches can help decipher these relationships via forward and inverse models.
inductive engineering relationships of goals and means flow from right to left. Further, it is important to note that each relationship from left to right is many-to-one, and consequently the ones from right to left are one-to-many. Thus, many processing routes can potentially result in the same structure of the material, and the same property of a material could be potentially achieved by multiple structures. Each experimental observation or simulation can be thought of as a data point for a forward model (e.g., a measurement or calculation of a property given the processing, composition, and structure parameters). A database of such data points can be used with a materials informatics approach such as predictive analytics to build data-driven forward models that can run in a very small fraction of the time it takes for doing the experiment or simulation. This acceleration of forward models can not only help to guide future simulations and experiments, but also make it possible to realize the inverse models, which are much more challenging and critical for materials discovery and design. The construction of inverse models is typically formulated as an optimization problem wherein a property or performance metric of interest is intended to be maximized or minimized, subject to the various constraints on the representation of the material, which is typically in the form of a composition-and/or structure-based function. The optimization process usually involves multiple invocations of the forward model, and thus having a fast forward model is extremely valuable. Further, since these inverse relationships are one-to-many, a good inverse model should be able to identify multiple optimal solutions (if they exist), so as to have the flexibility to select the material structure that can be attained in the easiest and most cost-effective manner.
In the rest of this article, we shall first discuss a generic workflow for conducting materials informatics, and then illustrate it with examples of some recent advances in this field in terms of development and application of big data analytics approaches for building both forward and inverse models for PSPP relationships. In particular, we will take the example of steel fatigue prediction using a set of experimental data (forward models), 9 predicting the stability of a compound using DFT simulation data (forward models) and subsequent discovery of stable ternary compounds (inverse models), 10 and structure-property optimization of a magnetoelastic material (inverse models). 11 Figure 3 depicts a typical end-to-end workflow for materials informatics. A variety of raw materials data as discussed earlier could be stored in heterogenous materials databases in potentially different data formats. Given a task at hand (say developing property prediction models), the first step is to understand the data format and representation, and do any necessary preprocessing to ensure the quality of the data before any modeling, and remove or appropriately deal with noise, outliers, missing values, duplicate data instances, etc. Usually the instances and/or attributes responsible for these are removed if they are easily identifiable and sufficient data are available, but optimally utilizing incomplete data remains an active area of research. Examples of data preprocessing steps include discretization, sampling, normalization, attribute-type conversion, feature extraction, feature selection, etc. Such data preprocessing can either be supervised or unsupervised, based on whether the process depends on the target attributes (here the property of the material to be predicted), and are thus usually considered separate stages in the workflow. Once appropriate data preprocessing has been performed and the data are ready for modeling, one can employ supervised data mining techniques for predictive modeling. Caution needs to be exercised here to appropriately split the data into training and testing sets (or use cross validation), else the model may be prone to overfitting and show over-optimistic accuracy. If the target attribute is numeric (e.g., fatigue strength, formation energy) regression techniques can be used for predictive modeling and if it is categorical (e.g., whether a compound is metallic or not), classification techniques can be used. Some techniques are capable of doing both classification and regression. There also exist several ensemble learning techniques that combine the results from base learners in different ways, and have shown to improve accuracy and robustness of the model in some cases. Table I lists some of the popular predictive modeling techniques. Apart from predictive modeling, one can also use other data mining techniques such as clustering and relationship mining depending on the goal of the project, for instance, to find group similar materials or discovering hidden patterns and associations in the data.

KNOWLEDGE DISCOVERY WORKFLOW FOR MATERIALS INFORMATICS
Proper evaluation of data-driven models is crucial. A data-driven model can, in principle, "memorize" every single instance of the dataset and thus result in 100% accuracy on the same data, but will most likely not be able to work well on unseen data. For this reason, advanced data-driven techniques that usually result in black-box models need to be evaluated on data that the model has not seen while training. A simple way to do this is to build the model only on part of the data, and use the remaining for evaluation. This can be generalized to k-fold cross validation, where the dataset is randomly split into k parts. k − 1 parts are used to build the model and the remaining one part is used for testing, and the process is repeated k times with different test splits. Cross-validation is a standard evaluation setting to eliminate any chances of over-fitting. Of course, k-fold cross-validation necessitates building k models, which may take a long time on large datasets. It is also important to note that cross-validation is supposed to be of the entire workflow and not just of the predictive model. Hence, any supervised data preprocessing should also be considered along with predictive modeling while performing cross-validation in order to get unbiased estimates of accuracy. Quantitative assessments of the models predictive accuracy   13 Classification A graphical model that encodes probabilistic conditional relationships among variables Logistic regression 14 Classification Fits data to a sigmoidal S-shaped logistic curve Linear regression 15 Regression A linear least-squares fit of the data w.r.t. input features Nearest-neighbor 16 Both Uses the most similar instance in the training data for making predictions Artificial neural networks 17,18 Both Uses hidden layer(s) of neurons to connect inputs and outputs, edge weights learnt using back propagation Support vector machines 19 Both Based on the structural risk minimization, constructs hyperplanes multidimensional feature space Decision table 20 Both Constructs rules involving different combinations of attributes Decision stump 21 Both A weak tree-based machine learning model consisting of a single-level decision tree J48 (C4.5) decision tree 22 Classification A decision tree model that identifies the splitting attribute based on information gain/gini impurity Alternating decision tree 23 Classification Tree consists of alternating prediction nodes and decision nodes, an instance traverses all applicable paths Logistic model tree 24,25 Classification A classification tree with logistic regression functions at the leaves M5 model tree 26,27 Regression A tree with linear regression function at the leaves Random tree Both Considers a randomly chosen subset of attributes Reduced error pruning tree 21 Both Builds a tree using information gain/variance and prunes it using reduced-error pruning to avoid over-fitting AdaBoost 28 Ensembling Boosting can significantly reduce error rate of a weak learning algorithm Bagging 29 Ensembling Builds multiple models on bootstrapped training data subsets to improve model stability by reducing variance Random subspace 30 Ensembling Constructs multiple trees systematically by pseudo-randomly selecting subsets of features Random forest 31 Ensembling An ensemble of multiple random trees Rotation forest 32 Ensembling Generates model ensembles based on feature extraction followed by axis rotations can be done with several classification/regression performance metrics such as accuracy, precision, recall/sensitivity, specificity, area under the receiver operating characteristics (ROC) curve, coefficient of correlation (R), explained variance (R 2 ), Mean Absolute Error (M AE), Root Mean Squared Error (RM SE), Standard Deviation of Error (SDE), etc. The resulting knowledge from this workflow can be represented in the form of invertible PSSP relationships, thereby facilitating materials discovery and design. This workflow can be used and leveraged at different stages by various stakeholders such as experimentalists, computational scientists, and materials informaticians. For example, one can add new data as they are generated, get summary statistics, identify key factors influencing a material property, find materials similar to a given material in the database, develop forward predictive models, use it to predict properties of new materials, and finally develop inverse models to discover materials with a desired property or set of properties.

Steel fatigue prediction
Agrawal et al. 9 used data from the Japan National Institute of Material Science (NIMS) Mat-Navi database 72 to make predictive models for fatigue strength of steel. Accurate prediction of fatigue strength of steels is important for several advanced technology applications due to the extremely high cost and time of fatigue testing and potentially disastrous consequences of fatigue failures. In fact, fatigue is known to account for more than 90% of all mechanical failures of structural components. 73 The NIMS data included composition and processing attributes of 371 carbon and low-alloy steels, 48 carburizing steels, and 18 spring steels. The materials informatics approach used consisted of a series of steps that included data preprocessing for consistency using domain knowledge, ranking-based feature selection, predictive modeling, and model evaluation using leave-one-out cross-validation (a special case of cross-validation where k = N, the number of instances in the dataset; it is generally used with small datasets for a more robust evaluation) with respect to various metrics for prediction accuracy. Twelve regression-based predictive modeling techniques were evaluated, and many of them were able to achieve a high predictive accuracy, with R 2 values ∼0.98, and error rate <4%, outperforming the only prior study on fatigue strength prediction reported R 2 values of <0.94. In particular, neural networks, decision trees, and multivariate polynomial regression were found to achieve a high R 2 value of >0.97. It is well known in the field of predictive data analytics that after a certain point it becomes more and more difficult to increase prediction accuracy. In other words, an improvement from 0.94 to 0.97 should be considered more significant than an improvement from 0.64 to 0.67. It was also observed from the scatter plots that the three grades of steels were well separated in most cases, and different techniques tended to perform better in different regions, which was also reflected in the distribution of errors. For example, multivariate polynomial regression gave the best R 2 of 0.9801 but also gave very poor predictions for some carbon and low alloy steels. It is encouraging to see good accuracy despite the limited amount of available experimental data; however, it was also concluded that the methods resulting in bimodal distribution of errors or the ones with significant peaks in higher error regions need to be used with caution even though their reported R 2 may be high. Nonetheless, this work successfully exhibited the use of data science methodologies to build forward models on experimental data connecting processing and composition directly with performance, in the context of PSPP relationships.

Stable compound discovery
In another work, Meredig, Agrawal et al. 10 employed a similar predictive analytics framework on simulation data from quantum mechanical DFT calculations. Performing a DFT simulation on a material requires its composition and crystal structure as input. Traditionally, there has been a lot of effort in the field of computational materials science on crystal structure prediction (CSP), which takes the composition as input and performs a global optimization of all possible unit cells to find the one with lowest formation energy. Thus, both CSP and DFT are computationally expensive, and this work used a database of existing DFT calculations to build forward models predicting the property of a material (in this case formation energy) using a variety of attributes all of which could be derived solely based on the materials stoichiometric composition. Of course, the predictive models are trained on DFT data, which implicitly use structure information, but once the model is built, it can subsequently be used to predict the formation energy of new materials without crystal structure as input. It was found that the developed models were excellent at predicting DFT formation energies to which they were not fit, with R 2 score greater than 0.9, and MAE well within DFT's typical agreement with experiment. Thus, these forward models could predict formation energy without any structural input, about six orders of magnitude faster than DFT, and without 053208-7 A. Agrawal and A. Choudhary APL Mater. 4, 053208 (2016) FIG. 4. A simple realization of the inverse models for PSPP relationships. The forward predictive model built using a supervised learning technique on a labeled materials dataset can be used to scan a combinatorial set of materials and thus convert this set to a ranked list, ordered by the predicted property. This can be followed by one or more screening steps to select and validate the predictions using simulation and/or experiments, thereby enabling data-driven materials discovery, which can in turn be fed back into the materials dataset to derive improved models, and so on. Blue arrows denote the forward model construction process, and green arrows denote the materials discovery process via inverse models.
sacrificing the accuracy of DFT compared to experiment. These models were subsequently used to conduct "virtual combinatorial chemistry" by scanning almost the entire unexplored ternary space of compounds of the form A x B y C z , to identify compounds that are predicted to be stable. This is tantamount to the inverse models described earlier in this article. Here A, B, C were selected from a list of 83 technologically relevant elements, and x, y, z were obtained using an enumeration procedure taking into account the statistically most common compositions in the Inorganic Crystal Structure Database (ICSD) and basic charge balance conditions. A total of approximately 1.6 × 10 6 ternary compositions were scanned to get predictions of formation energy in a matter of minutes, in contrast to tens of thousands of years that would be required for DFT simulations to do the same. Many interesting insights were obtained as a result of this study, which identified about 4500 ternary compositions as predictions for new stable compounds. Of these, 9 were systematically selected for a full DFT crystal structure test, and 8 of them were explicitly confirmed to be stable. Interestingly, if the 4500 predictions reported in this study are experimentally confirmed, it would represent an increase in the total number of known stable ternary compounds by more than 10%. This realization of inverse models is depicted in Figure 4.

Data-driven microstructure optimization
The last example that we would like to discuss is the recent work on microstructure optimization of a magnetoelastic Fe-Ga alloy (Galfenol) microstructure for enhanced elastic, plastic, and magnetostrictive properties. 11 When a magnetic field is applied to this alloy, the boundaries between the magnetic domains shift and rotate, and this leads to a change in its dimensions. This behavior is called magnetostriction, and has many applications in microscale sensors, actuators, and energy harvesting devices. Desirable properties for such a material include low young's modulus, high magnetostrictive strain, and high yield strength. Theoretical forward models for computing these properties for a given microstructure are available, but inverse models to obtain microstructures with the desired properties are very challenging. Note that the approach used in the previous example-stable compound discovery where the forward model could scan the entire ternary composition space to realize the inverse model-would be prohibitive here, since the microstructure space is too large. In this example, the microstructure was represented by an orientation distribution function (ODF) with 76 independent nodes leading to a 76 dimensional inverse problem (each value representing positive volume density of a crystal orientation). For a conservative estimate of the number of possible combinations, even if we assume just two possible values for each dimension, it would result in the order of 2 76 combinations, which would take more than 2 × 10 9 years to enumerate, even assuming it takes just 1 µs to compute a property with forward models. Thus, high dimensionality of microstructure space along with other challenges such as multi-objective design requirements and non-uniqueness of solutions makes even the traditional search-based optimization methods incompetent in terms of both searching efficiency and result optimality. In this work, a machine learning approach to address these challenges was developed, consisting of random data generation, feature selection, and classification algorithms. The key idea was to prune the search space as much as possible by search path refinement (ranking the ODF dimensions) and search region reduction (identifying promising regions within the range of each ODF dimension) so as to try to reach the optimal solution faster. More details of this approach are available elsewhere. 74 It was found that this data-driven approach for structure-property optimization could not only identify more optimal microstructures satisfying multiple linear and non-linear properties much faster than traditional optimization methods, but also discover multiple optimal solutions for some properties that were unknown for this problem before this study.

CONCLUSION
To summarize and conclude, (big) data driven analytics, which is the fourth paradigm of science has given rise to the emergence and popularity of materials informatics, and it is of central importance in realizing the vision of the materials genome initiative. We discussed a generic workflow for materials informatics along with three recent illustrative applications where data-driven analytics has been successfully used to learn invertible PSPP relationships. Currently, the field of materials informatics is still pretty much in its nascent stage, much like what bioinformatics was 20 years ago. Interdisciplinary collaborations bringing together expertise from materials science and computer science, and creating a workforce equipped with such interdisciplinary skills, are vital to optimally harness the wealth of opportunities becoming available in this arena, and enable timely discovery and deployment of advanced materials for the benefit of mankind.