Research Update: The materials genome initiative: Data sharing and the impact of collaborative ab initio databases

Materials innovations enable new technological capabilities and drive major societal advancements but have historically required long and costly development cycles. The Materials Genome Initiative (MGI) aims to greatly reduce this time and cost. In this paper, we focus on data reuse in the MGI and, in particular, discuss the impact of three di ff erent computational databases based on density functional theory methods to the research community. We also discuss and provide recommendations on technical aspects of data reuse, outline remaining fundamental challenges, and present an outlook on the future of MGI’s vision of data sharing. C 2016 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license (http: // creativecommons.org / licenses / by / 4.0 / ). [http: // dx.doi.org 10.1063


I. INTRODUCTION
Materials innovations are critical to technological progress. Many authors have noted that advanced materials are so vital to society that they serve as eponyms for historical periods such as the Stone Age, the Bronze Age, the Steel Age, the Age of Plastic, and the Silicon Age. 1-3 A recent study by Magee suggests that materials innovation has driven two-thirds of today's advancements in computation (in terms of calculations per second per dollar) and has similarly transformed other technologies such as information transport and energy storage. 4 However, while the benefits of new materials and processes are well established, the difficulties in achieving these breakthroughs and translating them to the commercial market 5 are not widely appreciated. It takes approximately 20 years to commercialize new materials technologies 6 and often even longer to develop those technologies in the first place. The long time scale of materials innovation stifles investment in early stage research because payback is unlikely. 6 Thus, it is natural to wonder whether it is possible to catalyze materials design such that decades of research and development can occur in the span of years.
One contributing factor that stunts materials development is lack of information. For newly hypothesized compounds, the answer to several important questions must typically be guided only by intuition. Can the material be synthesized? What are its electronic properties? What defects are likely to form? The dearth of materials property information can cause researchers to focus on the wrong compounds or persist needlessly in optimizing of materials that, even under ideal conditions, would not be able to meet performance specifications. LiFePO 4 -a successful lithium ion battery cathode material-serves as a good example of how overlooked opportunities stem from a lack of information. Its crystal structure was first reported 7 in 1938, but LiFePO 4 's application to lithium ion batteries was discovered only in 1997 by Padhi et al. 8 If the compound's lithium insertion voltage and its exceptional lithium mobility were known earlier, it is likely that LiFePO 4 's application to batteries would have arrived one or two decades earlier. If its phase diagram had been previously charted, the synthesis conditions under which this material is made could have been more rapidly optimized.
a Author to whom correspondence should be addressed. Electronic mail: ajain@lbl.gov FIG. 1. Schematic to illustrate differences between standard, single-group research (left) and new opportunities afforded by large materials databases (right). The standard research cycle is considerably longer because testing and refining hypotheses typically involves several time-consuming data collection steps. In contrast, the availability of large data sets in the data-driven approach enables multiple hypotheses to be tested simultaneously through novel informatics-based approaches, reducing the burden of data collection.

II. EXAMPLES OF DATA REUSE
Material scientists have been sharing data for several decades. For example, several longstanding crystal structure databases have recently been reviewed by Glasser 30 and their impact on materials science has been reported by Le Page. 31 Experimental researchers routinely rely on diffraction pattern databases such as the Powder Diffraction File from the International Center for Diffraction Data 32 for phase identification of their samples. Thermodynamic databases [33][34][35] and handbooks 36,37 have been used to build and refine highly successful models such as the Calphad method 27,38,39 for generating phase diagrams. Thus, data sharing in materials science certainly predates the MGI.
What is novel under the MGI and related programs is (i) the scale at which data sharing is emphasized and (ii) the rapid expansion of theory-driven databases. Data sharing is heavily encouraged and enforced, [40][41][42] and new experimental databases such as the Structural Materials Data Demonstration Project, Materials Data Facility, and Materials Atlas are being funded. Large Department of Energy research hubs are incorporating combinatorial screening, 43 high-throughput computational screening, 44,45 and data mining 46 into the discovery process. New resources are rapidly coming online: the authors knew of no database of ab initio calculations that existed a decade ago, yet a recent review by Lin 47 lists 13 such databases that exist today (many of which contain hundreds of thousands of data points).
Three databases that have received MGI support are the Materials Project (MP), 48 AFLOWlib, 49 and Open Quantum Materials Database (OQMD). 50 The MP is centered at Lawrence Berkeley National Laboratory. Distinguishing features include its large and growing user base (currently over 17 000 registered users), its emphasis on "apps" for exploring the data, and database of elastic 51 and piezoelectric 52 tensor properties. The AFLOWLib consortium is centered at Duke University. A major feature of this resource is its large number of total compounds and its applications towards alloys, spintronics, and scintillation. The OQMD is centered at Northwestern University. Distinguishing features include comprehensive calculations for popular structure types (e.g., Heusler and perovskite compounds) and a focus on new compound discovery.
Unsurprisingly, all three of these databases have been used extensively by their respective development teams. However, although these resources are each only a few years old, there is already an emerging body of literature demonstrating their usage. Next, we present some early examples of research using DFT database resources conducted without the involvement of the development team.

A. Applications of computational databases to theoretical studies
A first application of computational databases is to aid and inspire further computational studies. Indeed, database projects such as ESTEST 53 and NoMaD 54 are geared especially to aid in reproducibility, verification, and validation of computational results. Parameters documented and tested by one project [55][56][57] can, in some cases, develop into a semi-standard methodology for the community. For example, the Hubbard interaction term, pseudopotentials, and reference state energies employed by the MP for DFT calculations are often re-used by the community. [58][59][60][61][62][63][64] The atomistic modeling community has similarly established databases of interatomic potentials such as OpenKIM 1,65 and NIST Interatomic Potentials Repository Project 66 that can be recycled for new applications. Such resources can be complementary, and information from DFT databases can help parameterize interatomic potentials. For example, Pun et al. 67 used energies computed by the AFLOWlib and OQMD databases to develop a force field for the Cu-Ta system. Data from the ab initio scale can inform materials behavior at higher length scales in other ways as well. For example, Gibson and Schuh 68 employed formation energies from the MP to help establish energy scales for grain boundary cohesion. These kinds of approaches that bridge length scales encompass a key component of the MGI called integrated computational materials engineering (ICME). 22,23,69 A second use case of DFT databases is as an established reference against which to compare results. For example, a study by Miletic et al. used the AFLOWlib database to test their computations of lattice parameters and magnetic moment of YNi 5 . 70 Romero et al. developed a new structure search algorithm and used all three database resources (MP, AFLOWlib, and OQMD) to verify that none of the new predicted compounds were previously recorded. 71 Sarmiento-Pérez et al., 72 74,75 One computational method that has particularly benefited from DFT materials databases is the estimation of the thermodynamic stability of new and hypothetical materials. Estimating this quantity requires calculating all known phases within the chemical space of the target phase and then applying a convex hull analysis to determine the lowest energy combination of phases at the target composition. This process, which was in the past tedious and time-consuming, can today be performed within minimal effort using the data and software tools 77 already provided by larger DFT data providers (Fig. 2). Examples include stability analysis of new photocatalysts 78,79 and other applications [80][81][82][83][84] (assisted by MP data), high entropy alloys 76 (assisted by AFLOWlib data), and dichalcogenides for hydrogen evolution 85 (assisted by OQMD data).
Finally, computational material databases can be used for property prediction and materials screening. For example, Sun et al. used the OQMD to help predict lithiation energies of MXene compounds, 86 and Gschwind et al. similarly used the MP to predict fluoride ion battery voltages. 87 Hong et al. employed MP data to screen new materials as oxygen evolution reaction catalysts. 88 Impressively, some groups have been able to download and use very large data sets in their study. For example, Seko et al. 89 extracted 54 779 relaxed compounds as starting points from the MP to virtually screen for low thermal conductivity compounds using anharmonic lattice dynamics calculations. Tada et al. used 34 000 compounds from the MP as a basis for screening new materials as two-dimensional electrides. 90 Thus, the usage of computational databases can involve confirming a few data points or screening tens of thousands of compounds.

B. Applications of computational databases to experimental studies
Perhaps an even greater impact to materials science will come from the experimental community's usage of computational databases. A first example application is comparing theoretical and experimental data in order to verify both or fill gaps in information. For example, the MP is often used to look up materials properties such as lattice parameters, XRD peak positions, or even battery conversion voltages. [91][92][93] The comprehensive nature of these databases can make them very powerful tools when used in concert with experiment. For example, both the AFLOWlib and OQMD contain data on a large number of Heusler-type phases. This helped the research group of Nash perform multiple studies that map out the thermodynamics and phase equilibria of Heusler 94-97 and half-Heusler 98 phases.
Similar to the situation for theorists, one of the most popular uses of computational databases by experimental researchers has been to generate phase diagrams. Multiple studies 80,100-102 have employed the MP Phase Diagram App to establish whether their materials of interest are likely to form under certain conditions. For example Martinolich and Neilson 100 report that NaFeS 2 is reported (by MP) to be unstable with respect to decomposition into FeS 2 , FeS, and Na 2 S, which is consistent with their unsuccessful attempts to prepare the ternary phase. Computed phase diagrams from DFT are also useful for other purposes. For example, He et al. investigated the heterogeneous sodiation mechanism of iron fluoride (FeF 2 ) nanoparticle electrodes by combining in situ and ex situ microscopy and spectroscopy techniques. 99 The MP Li-Fe-F phase diagram was used to interpret the observed reaction pathways and illustrate the difference between equilibrium reactions and the metastable phases that may form because of kinetic limitations (Figure 3). Another example is from the work of MacEachern et al., 103 who superimposed the results of sputtering experiments onto the MP phase diagram for Fe-Si-Zn. Similarly, Nagase et al. 104 employed the computed MP Co-Cu-Zr-B quarternary phase diagram from MP to guide experimental exploration of amorphous phase formation. In another work, the MP phase diagram of Li-Ni-Si was used to highlight the existence of a ternary lithiated phase that could impact the performance of nickel silicides as Li-ion anode materials. 105 The calculated electronic structure (e.g., the band structure of compounds) is also frequently used in experimental studies despite the known underestimation of band gaps in standard DFT. 107   charge transport in the bulk phase and motivate the need for low-dimensional particles or thin films for solar hydrogen production (Fig. 4). 106 When considering that the role of density functional theory calculations was limited to the dominion of the theoretical physicist only two to three decades ago, these examples indicate an encouraging trend of DFT calculations becoming practical materials science design tools that are accessible to the broader research community. In alignment with the MGI vision, the examples above highlight a growing trend in which research data (whether computational or experimental) are routinely compared against past results and in which new studies are motivated or explained at least in part by data compiled by computational databases.

III. TECHNICAL ASPECTS OF DATA SHARING
With the benefits of data sharing established, we now concentrate on three technical issues in data sharing: data formats, data dissemination strategies, and data centralization. Our discussion is rooted in our experience in collaboratively building two tools: the Materials Project API (application programming interface) (MAPI), 108 which has been used by over 300 distinct users to download approximately 15 × 10 6 data points, and the MPContribs framework, 109,110 which allows researchers to contribute external data into the MP.

A. Data formats
Simulation software and experimental instruments typically output their own proprietary data file formats. Professional societies often develop custom data formats, such as the CIF (Crystallographic Information File) standard, 111 which define their own semantics and require building custom parsers to support that standard. Currently, the situation for data formats in materials science resembles a "wild west" scenario that makes it difficult for end users to work with data from multiple sources.
One strategy that can help promote standardization to build materials-centric specifications off of standard data formats that are fully open and cross-platform. For example, tabular data can conform to a comma-separated-values (CSV) format that is easily readable by almost all analysis packages and programming languages. Large, array-based data can be formatted as HDF5 112 or netCDF. 113 For more complex data, e.g., the representation of crystal structures or input parameters and output quantities such as band structures, options include Extensible Markup Language (XML) 114 and Javascript Object Notation (JSON) 115 formats. Successful examples include ChemML 116 and MatML, 117 which build scientific specifications on top of the XML standard. Our preference is for the JSON format, which is smaller in size and simpler to construct. JSON documents can be directly stored in nextgeneration database technologies such as MongoDB 118 which allow rich search capabilities over these documents. JSON can also be interconverted to the Yet Another Markup Language (YAML) format, which can be easily read and edited in text editors.

B. Data dissemination strategies
Another technical aspect of data reuse concerns the most appropriate way to expose large data sets to the research community. Individual data providers can select from several options for exposing their data, including file download (e.g., as CSV or ZIP archive), a database dump (e.g., a MySQL dump file), or exposing an API over the web. The first two of these methods are relatively straightforward; the final method merits further clarification. An API is a protocol for interacting with a piece of software or data resource. For sharing data over the web, a common pattern is to use a REpresentaitional State Transfer (REST) API. 119 REST APIs are employed by several modern software companies, including Google, Microsoft, Facebook, Twitter, Dropbox, and others. In the most straightforward use case, one can imagine a REST API as mapping web URLs to data in the same way that a folder structure is used to organize files. For example, an API can be set up such that requesting the URL "https:// www.materialsproject.org/rest/v2/materials/Fe2O3/vasp" returns a JSON representation of the data on all Fe 2 O 3 compounds as computed by the VASP 120 software. However, RESTful APIs are more powerful than file folder trees. For example, the HTTP request used to fetch the URL can additionally incorporate parameters. Such parameters might include an API key that manages user access control  over different parts of the data or a constraint that helps in filtering the search. REST URLs can additionally point not only to data but also to backend functions. For example, the URL "https://www. materialsproject.org/rest/v1/materials/snl/submit" might trigger a backend function that registers a request to compute the desired structure embedded in an Hypertext Transfer Protocol (HTTP) POST parameter. RESTful APIs can be intimidating to novices but can be made more user-friendly by making the URL scheme explorable and through intermediate software layers. We present a comparison between data dissemination methods in Table I.

C. Data centralization
A final technical aspect we discuss is the strategy for combining information from distinct data resources. In particular, data can be stored and managed by a small, large, or intermediate number of entities, which we broadly classify as "centralized," "decentralized," and "semi-centralized," respectively ( Figure 5). One organization that has made significant progress in establishing a centralized data resource for materials scientists is Citrine Informatics, a company that specializes in applying data mining to materials discovery and optimization. Citrine's data sharing service benefits its core business because its proprietary data mining algorithms become more powerful with access to larger datasets. Data providers and research groups that contribute data to the Citrine database benefit from greater visibility of their work and straightforward compliance with data management requirements. Major benefits of such data centralization by industry include no cost to government funding agencies and the ability to leverage a professional software engineering team to build and support the infrastructure. A potential concern is longevity of the data, which is mitigated through an open API that allows making public copies of the data.
As depicted more generally in Figure 5(a), data contributed to Citrine's platform ("Citrination") 121 are reshaped into their custom Materials Information File (MIF) format, 122 a JSON-based materials data standard. The data are hosted on Citrine's servers and can be accessed by the public either through Citrination's web interface or through a RESTful API. The Citrination search engine today includes almost 250 data sets from various experimental and computational sources. Examples include data on thermoelectric materials, bulk metallic glasses, and combinatorial photocatalysis experiments.
A different organizational structure, depicted in Figure 5(b), is to decentralize data providers and to combine data as needed from multiple data APIs. Advantages of this technique include ease of supporting diverse data, a greater likelihood of maintaining up-to-date search results, and reducing the burden (e.g., storage, bandwidth) on any single data provider. A disadvantage is that each data provider must host and maintain its public API. The decentralized system, which is in some ways analogous to the way that Google indexes web pages hosted by many different servers, requires the development of search engines that are capable of working with multiple data resources. For example, in the chemistry community, the ChemSpider 123 search engine has had considerable success with this strategy.
The most likely scenario is that the community will reach an equilibrium that balances both strategies, as in Figure 5(c). Larger data providers that are continuously generating new data and new features will perhaps maintain their own data service, whereas smaller data outfits that want to submit "one-time" data sets will likely contribute to a larger service.

IV. CHALLENGES AND OUTLOOK
Many of the issues specific to sharing data in materials have been outlined previously. 124,125 Here, we focus on three key challenges that require innovative solutions: parsing unstructured data, combining multiple data sources, and communicating the nuances of different data sets.
Most materials measurements are not available in a structured format. Instead, they are isolated in laboratory notebooks or embedded within a chart, table, or text of a written document. For example, routes for synthesizing materials are almost exclusively reported as text without a formal data structure. Extracting such data into structured knowledge, e.g., using natural language processing, represents a major challenge for materials research. 126 A second challenge is in combining information from multiple data sets. To illustrate this problem, consider the case of the experimental thermoelectrics data set from the work of Gaultois et al. 127 and a second computational data set from the work of Gorai et al., 128 both of which can be downloaded from each project's respective web site or from the Citrination platform. Ideally, one would like to make a comparison, e.g., to assess the accuracy of the computational data with respect to the experimental measurements. Such a study requires matching common entries between databases. As a first step, one can match the entries in each database which pertain to the same composition. However, ensuring that the crystal structures also match is usually much more challenging. In the example of the two thermoelectrics data sets, once can make a match based on Inorganic Crystal Structure Database (ICSD) identification number. In other situations, little to no crystal structure information may be available in one or both data sets. Next, one must align the physical conditions of each entry; in this instance, the computational data are reported at 0 K, whereas the experimental measurements are reported at multiple temperatures between 300 K and 1000 K. Thus, a researcher must select the 300 K experimental data as the most relevant, keeping in mind that the remaining difference in temperature represents a potential source of error. Finally, other material parameters must be considered: for example, the microstructure and defect concentrations of the experimental data can be very different than the pure, single crystal model of many computations. Much of the time, such detailed information on the material is not reported or even known. The situation becomes more troublesome as data sets grow larger because it becomes increasingly time-consuming to employ manual analysis to aid in the process. Better strategies and additional metadata on each measurement are needed to confidently perform analyses across data sets.
A third major challenge is in understanding and communicating the level of accuracy that can be expected from different characterization methods. Interpreting the nuances and evaluating the potential errors of a measurement or simulation technique is typically a subject reserved for domain experts. For example, we have observed that the error bar of "standard" DFT methods in predicting the reaction enthalpies between oxides is approximately 24 meV/atom. 129 However, the error bar is much higher (172 meV/atom) for reactions that involve both metals and oxides. More complicated still, by applying a correction strategy, once can reduce the latter error bar to approximately 45 meV/atom, 130 although different correction strategies have been proposed by different research groups. [131][132][133] Issues of this kind remain very difficult to communicate to non-specialists but are very important when advocating data reuse.
Looking forward, we expect that learning to work with data, and particularly data that are not self-generated, to become an important skill for material scientists. The integration of computer scientists and statisticians into the domain of materials science will also be vital. The path paved by the biological sciences serves as a good model: bioinformatics and biostatistics are today well-recognized fields in their own right, and biology and health-related fields are recognized as being more multidisciplinary than materials science today. 134 One can also ask whether materials science will be impacted by the revolution in "big data." 135 While big data can be a nebulous term, reports often point to the "three V's"-i.e., the volume, velocity, and variety of the data (and sometimes extended to include veracity and value). 136 Other definitions of big data are that it occurs when there are significant problems in loading, transferring, and processing the data set using conventional hardware and software, or when the data size reaches a scale that enables qualitatively new types of analysis approaches. With respect to volume and velocity (rate at which data is generated), materials science has not hit the "big data" mark as compared with  137 For comparison, all the raw data stored in the MP amount to approximately 100 terabytes (a factor of 2000 less than CERN) and the final data sets exposed to users amount to approximately 1 terabyte. Data from large experimental facilities such as light sources may change this assessment in the future, 135 as might software that allows running several orders of magnitude more computations on ever-larger computing resources. [138][139][140] Although materials science data are not particularly large, it is challenging to work with in terms of the variety of data types and the complexity of objects such as periodically repeating crystals or mass spectroscopy data. As for whether data sets will open new qualitative research avenues, the size of materials data sets is likely still too small to apply some machine learning techniques such as "deep learning." 141,142 However, more efficient machine learning algorithms for such kinds of learning are being developed 143 which hint that the coming wave in materials data will indeed open up new types of analysis methods.
Transitioning from data to insight will remain a major research topic. Much recent research has begun applying data mining techniques to materials data sets, but "materials informatics" 144 remains a nascent field of study. To propel this field forward, Materials Hackathons 145 organized by Citrine Informatics and a Materials Data Challenge hosted by NIST 146 have been announced to encourage advancement through competition. Further progress might be stimulated by incorporating data visualization and analysis tools directly into common materials database portals. For example, we envision that in the future, one will be able to log on to the MP, identify a target property, visualize the distribution of that property over known materials, extract relevant descriptors, and build and test a predictive model.
Many challenges lie ahead for uncovering the materials genome, i.e., to understand what factors are responsible for the complex behavior of advanced materials through the use of data-driven methods. However, the recent examples demonstrating that theorists and experimental researchers alike can apply online ab initio computation databases towards cutting-edge research problems is an encouraging harbinger of a new and exciting era in collaborative materials discovery.

ACKNOWLEDGMENTS
This work was funded and intellectually led by the Materials Project (DOE Basic Energy Sciences Grant No. EDCBEE). Work at the Lawrence Berkeley National Laboratory was supported by the U.S. Department of Energy Office of Science, Office of Basic Energy Sciences Department under Contract No. DE-AC02-05CH11231. We thank Bryce Meredig for discussions regarding the Citrination search platform.