Web-based machine learning models for real-time screening of thermoelectric materials properties

properties Michael W. Gaultois, a) Anton O. Oliynyk, Arthur Mar, Taylor D. Sparks, Gregory J. Mulholland, and Bryce Meredig b) Department of Chemistry, University of Cambridge, Cambridge, CB2 1EW, United Kingdom Department of Chemistry, University of Alberta, Edmonton, Alberta, T6G 2G2, Canada Department of Materials Science and Engineering, University of Utah, Salt Lake City, Utah, 84112, USA Citrine Informatics, Redwood City, California, 94063, USA


I. INTRODUCTION
For any materials problem, breaking out of "local optima" in composition space to discover entirely new chemistries remains a notoriously difficult challenge. 5any of the most notable materials classes under investigation today-from Na x CoO 2 derived thermoelectrics 6 to iron arsenide superconductors 7 -were discovered fortuitously.As a result, experimental efforts often gravitate toward incrementally improving known chemistries (via doping, nanostructuring, etc.), as these efforts are more likely to bear fruit than high-risk searches through chemical whitespace for entirely new materials.
The consequence of research communities' focus on further exploitation of known chemistries rather than exploration of unknown chemistries is that much of composition space simply remains uncharacterized.We illustrate the remarkable chemical homogeneity of most thermoelectric materials investigated to date by plotting each material from the thermoelectric database of Gaultois et al. 8 on the periodic table based on the compositiona) Electronic mail: mg757@cam.ac.uk b) Electronic mail: bryce@citrine.ioweighted average of the positions of elements in the material ( Fig. 1).The tight cluster of previously investigated chemistries is, as expected, dominated by chalcogenides and p-block elements such as Sn and Sb.In contrast, we also show the positions of Gd 12 Co 5 Bi and Er 12 Co 5 Bi, materials derived from our recommendation engine, which we characterize as a new class of thermoelectrics in this work.These materials are almost pure intermetallics, in sharp contrast to thermoelectric compounds investigated to date (Fig. 2).The objective of our recommendation engine is to directly enable experimental researchers to rapidly identify new materials, such as RE 12 Co 5 Bi, that are very distinct from known compound classes, and worthy of further study.

A materials recommendation engine
Our recommendation engine is a machine learningbased approach 9,10 for efficiently driving synthetic efforts toward promising new chemistries.We have trained a machine learning model to make a confidence level prediction of whether the (1) Seebeck coefficient, (2) electrical resistivity, (3) thermal conductivity, and (4) band gap of input materials are within acceptable ranges for thermoelectric applications.We define these ranges FIG. 1.Most known thermoelectric materials lie in a tight cluster in composition space (black and blue dots; blue dots have chemical formulae explicitly labelled).The recommendation engine presented here allows the identification of new thermoelectric materials families that are well outside the existing composition space of common systems in the Gaultois et al. database. 8In particular, we report the characterization of RE12Co5Bi (RE = Gd, Er; orange squares), which are chemically and structurally distinct from known thermoelectrics.FIG. 2. The strongly intermetallic RE12Co5Bi compounds we report here lie far outside the norm for metal loading among collected thermoelectric compositions in the Gaultois et al. database. 8The recommendation of these materials was neither the result of simple interpolation between known compounds nor obvious from a strict chemical intuition standpoint.
For each range of thermoelectric property, the engine gives a confidence score between 0% and 100% that a given material's measured value for that property at room temperature will fall within the targeted range.We would classify any material for which the answer to all these questions is likely "yes" as a potentially promising thermoelectric that may warrant further study.The purpose of our recommendation engine is thus nei-ther to make quantitative predictions of these thermoelectric properties, nor to definitively identify recordsetting compounds-these remain open challenges for future work.Rather, the engine is intended to greatly augment the chemical intuition of experimental researchers working on materials discovery.In particular, we have found that our model's ability to screen vast numbers of possible compositions and short-list interesting candidates can inspire materials syntheses that would not have been obvious a priori.
Machine learning models such as those developed here differ considerably from atomistic simulation approaches such as density functional theory (DFT).4][15][16][17] Nevertheless, accurately predicting thermoelectric properties from first principles remains challenging. 4Recent works, for example, use the BoltzTraP code 18 to estimate the Boltzmann transport properties of candidate materials based on DFT-predicted band structures. 19The nascent field of materials informatics-algorithmically extracting new knowledge by mining large-scale materials databaseshas emerged alongside these traditional physics-based simulations as a key means of predicting materials behavior. 20,21he present machine learning-based recommendation engine looks for empirical, chemically meaningful patterns in experimentally reported data on known thermoelectric compounds to make statistical predictions for the performance of new materials.Further, while efforts such as the Materials Project are making the results of DFT calculations more accessible to the experimental materials community than ever before, 5 most experimentalists still are not able to run DFT calculations continually to inform their laboratory work in real-time.
To make predictive computation more widely accessible, we make the results of the present work available as a web application (http://thermoelectrics.citrination.com)that any materials researcher can utilize to request realtime predictions and search for new thermoelectric candidates.

Modelling and informatics
Here we describe the approach used to construct the recommendation engine.Our engine is an example of materials informatics, 22,23 or the application of empirical machine learning methods to the prediction of materials behaviour.Any machine learning approach for materials relies on three key ingredients: training data, descriptors, and choice of algorithm.Training data are the example sets from which the machine learning approach should extract meaningful chemical trends.Descriptors are the low-level characteristics of materials (e.g., crystal structure, chemical formula, etc.) that might correlate with materials properties of interest.Specifically, descriptors are either numerical (e.g., average atomic number Z) or categorical (e.g., crystal structure = perovskite) variables that enable us to "vectorize" materials in such a way that they become amenable to machine learning techniques.Finally, learning algorithms interrogate descriptor-vectorized training data for relevant patterns.
In this work, the training set comprises a large body of both experimental thermoelectric characterization data, 8 experimental materials property data from the NIMS MatNavi database, and first principles-derived electronic structure data. 5,24These data are publicly available via the Citrination platform (http://www.citrination.com),the Materials Project API (http://www.materialsproject.org/open), and NIMS (http://mits.nims.go.jp/index en.html).
These data consist of the Seebeck coefficients, thermal conductivities, electrical conductivities, and band gaps measured for thousands of materials as a function of temperature and a variety of other metadata conditions.Our model uses these input data to learn interesting chemical trends that could be exploited to design new materials.As large, high-quality training data sets are scarce in materials science relative to the biological sciences, where bioinformatics has become a standard tool, we urge the materials community to consider contributing to data infrastructures (Citrination, Materials Project, NIST's DSpace repository, EU's NoMaD, and others) that together will significantly expand open access to data for materials researchers.
Descriptors are the second key ingredient in materials informatics.The scientific literature around designing descriptors for materials has grown substantially in just the past several years. 25,26Indeed, recent work has shown that the predictive power of machine learning models for materials is strongly dependent upon the selected descriptor set. 27Our engine relies upon a tuned blend of descriptors designed in-house and drawn from a variety of sources. 4,9By way of example, as materials scientists, we recognize that the periodic table contains a tremendous amount of information about how the elements behave and interact.We thus pre-bias our machine learning models with such knowledge (e.g., the d block of the periodic table is metallic; Li and Na are chemically very similar but not identical; and the lanthanides behave similarly in ionic compounds).This step allows us to create predictive models with data sets that have thousands (rather than tens or hundreds of thousands) of examples.
The ability of materials informatics techniques to extract signal from materials data is strongly dependant on effective descriptor design and access to large quantities of training data.With respect to the latter point, machine learning algorithms are only able to identify patterns that are (at least sparsely) sampled by the training data.An important manifestation of this requirement in the context of the present work is modeling doping.Doping represents a minute change in materials composition (on an atomic percentage basis), but may result in orders of magnitude changes in properties.As most of the training data used in this work correspond to undoped bulk compounds, we expect the recommendation engine to perform best in identifying new such bulk systems which could be potentially be further optimized via doping.Given more training data, we could readily extend the current work to dilutely doped thermoelectric systems.
Finally, our recommendation engine is built using the so-called random forest algorithm. 28This algorithm constructs a large number of decision trees, all trained on slightly different subsets of the training data.Random forest is an ensembling technique, which takes advantage of the fact that a collection of "weak" learners such as decision trees can, in concert, model extraordinarily complex nonlinear behaviour.An example rule that a single decision tree might learn is that if a material contains two elements with very different electronegativities (e.g., Na and Cl), that material is likely to have a large band gap.Of course, the thermoelectric phenomena we seek to model here are substantially more subtle, and thus a large random forest of decision trees is useful in untangling the underlying physics.We refer the reader elsewhere 4,9,29 for more detailed discussions and tutorials on how to apply random forests to materials data.

Model validation
We visualize the accuracy of our recommendation engine's predictions in Fig. 3, which represents the results of leave-one-out cross-validation (LOOCV) on our training data (in the case of the band gap data, we performed LOOCV on a subset of the extremely large training set).For each material in our training set and each property, the recommendation engine gives a confidence score between 0 and 1 that the property value falls within the ideal windows we have defined for thermoelectric applications.Errors approaching +1 represent false negatives (our engine was extremely confident the material would be poor for that property, but the property is actually good); and an error of −1 is a false positive (our engine was extremely confident the material would be good for that property, but the property is actually poor).The peak around 0 for each property shows that the engine generally gives confidence values very close to unity for materials possessing properties in the desired ranges, or close to zero for materials whose property values fall outside the target range.
In the LOOCV procedure, if we have n total measurements of a particular property such as thermal conductivity, we train our machine learning model on n − 1 of these values and predict the nth (left out) value.We perform one training step and prediction for each property value, and present the error distribution for all n values in Fig. 3.The error distribution then provides us with a sense of how we may expect the model to perform on new materials of which we have no prior knowledge.
Fig. 3 indicates that our engine generally makes very reliable assessments of thermoelectric materials properties.The modes of the error distributions are in each case close to 0. For each property, the engine's errors skew toward false negatives (resistivity, band gap, thermal conductivity) or false positives (Seebeck), which reflects the fact that the underlying training data do not contain equal fractions of positive and negative examples.Seebeck coefficients prove most difficult to assess (i.e., the error distribution for that property has the largest standard deviation), likely because there are strikingly different mechanisms that underpin the values, for example, strongly correlated oxides as opposed to degenerate semiconductors.Owing to the difficulty in assessing the Seebeck coefficient, initial predictive models using only the electrical resistivity, thermal conductivity, and See-beck coefficient produced too many candidates that were good metals with poor Seebeck coefficients.To remedy this shortcoming and provide more robust recommendations, the band gap was added as a secondary metric, where we determine the probability whether a given composition will have a non-zero bandgap.
Experimental details RE 12 Co 5 Bi (RE = Gd, Er) samples were made by arcmelting freshly filed Er or Gd pieces (99.9%,Hefa), Co powder (99.8%,Cerac), and Bi powder (99.999%,Alfa Aesar).Stoichiometric mixtures (0.5 g total mass) with 5% to 7% excess Bi were pressed into pellets and melted twice in arc-melting furnace under argon atmosphere (Edmund Bühler Compact Arc Melter MAM-1).The total mass loss after melting was < 1%.The samples were sealed in silica tubes and annealed at 1070 K for one week, then quenched in cold water.To produce enough material for physical property measurement, ∼70 samples of each compound were prepared, and pure samples were combined by melting into a single ingot of ∼5 g, which was sanded to yield the appropriate geometry (either a rectangular bar, or a cylinder).Density was measured using Archimedes' method; the final pellets had densities 100% of the single crystal values (ρ Gd12Co5Bi = 8.6 g/cm 3 , ρ Er12Co5Bi = 9.9 g/cm 3 ).
Powder X-ray diffraction patterns were collected using an INEL CPS 120 diffractometer with Cu Kα 1 radiation at room temperature, and Rietveld refinement was used to confirm the structure and phase purity (see Supporting Information). 30Backscatter electron microscopy and elemental analysis via energy dispersive X-ray spectroscopy (EDX) were performed with a JEOL JSM-6010LA InTouchScope scanning electron microscope.Backscatter micrographs reveal the samples are largely compositionally homogeneous (see Supporting Information). 30Quantitative elemental analysis on several polished pieces found an atomic composition of Gd69(2)Co26(2)Bi5(2) which is in a good agreement with expected RE 12 Co 5 Bi composition.Er 12 Co 5 Bi samples were not appropriate for quantitative analysis because of overlapping Co Kα (6.924 keV) and Er Lα (6.947 keV) lines.
High-temperature thermoelectric properties (electrical resistivity and Seebeck coefficient) were measured with an ULVAC Technologies ZEM-3.Sample bars had approximate dimensions of 9 mm×4 mm×4 mm.Measurements were performed with a helium under-pressure, and data was collected from 300 K to 800 K through three heating and cooling cycles over 18 hours to ensure sample stability and reproducibility.

III. DISCUSSION
In this work, we are interested not only in developing a model that gives accurate predictions of materials properties, but also in making it immediately accessible and useful for experimental researchers.To that end, we have published our recommendation engine as a web app at http://thermoelectrics.citrination.com,where researchers may explore a pre-computed list of around 25 000 known compounds (representing a sizable subset of the Inorganic Crystal Structure Database, or ICSD), and also use our model to evaluate their own materials candidates in real-time.In this way, we hope that the app serves as a rapid triage tool for ideas for potential new thermoelectric materials.
This adds to a growing toolbox of computational tools designed to be a user-friendly aid to experimental workers, such as TEDesignLab, and the Materials Project. 5,31ur pre-computed list may be arranged according to the probabilities associated with any one of the four properties we are modelling, and is sorted by default according a composite score that takes all four properties into account.Furthermore, the user may specify cutoff thresholds for any of the properties, and thereby greatly reduce the size of the list.
As we believe our extensive precomputed list contains some interesting and heretofore uncharacterized candidate thermoelectric materials, we now comment on a select set of high-ranking compounds.Several of these compounds are given in Table I.
TaVO 5 and TaPO 5 occur in an analogous crystal structure to the phosphate tungsten bronzes. 32,33These materials can be expected to have good thermoelectric performance given the heavy atoms, the potential for low electrical resistivity provided by the repeating ReO 3 -type structural network that is highly connected in three dimensions, and the intrinsic crystallographic shear provided by the crystal structure.Although the phosphate tungsten bronzes themselves are not highly rated, their metallic electrical transport properties are encouraging for structural analogues. 34Moreover, TaVO 5 has a negative coefficient of thermal expansion and a structural transition at 600 • C. 35 This structural transition may lead to softening of phonon modes and anharmonic scattering, which may lead to low thermal conductivity.
Other interesting suggestions to come from the recommendation engine are Tl 9 SbTe 6 , Ba 2 Pb, and FeAs 2 .Although none of these compounds were included in the thermoelectric database, they all scored highly within the recommendation engine.][38] The suggestion of TaAlO 4 , SrCrO 3 , TaSbO 4 and other oxides expected to be insulators can be understood because the recommendation engine uses as training data references where stoichiometric formulas were primar-ily reported rather than doping details. 39,40 Nevertheless, with doping through substitution or reduction, these compound may exhibit moderate electrical performance.Further, these materials all feature extended structures that are highly connected in three dimensions, an important feature for low electrical resistivity.Moreover, the large mass contrast on the cation sublattice in TaAlO 4 (edge shared TaO 6 and AlO 6 octahedra) could lead to low thermal conductivity, and previous reports have shown that SrCrO 3 is metallic when synthesized under pressure. 41any of the high-ranking candidate materials are interesting because of their highly connected extended structures, even though the recommendation engine does not use features of crystal structure to make its suggestions.The chief disadvantage to training prediction algorithms using crystal structure is that structure then becomes a required input for making predictions, and yet structure is by definition not available for uncharacterized materials.However, the absence of crystal structure does cause our engine difficulty where changes in crystal structure with similar elemental compositions cause large changes in physical properties.For example, both DyPO 4 and LaPO 4 are predicted to have low thermal conductivity.However, LaPO 4 is monazite, a corner edge-shared structure, whereas DyPO 4 is xenotime, 42 an edge-shared structure leading to inherently higher thermal conductivity. 43

New materials and their properties
Our final and most important task in this work is to demonstrate that our recommendation engine can indeed guide researchers toward interesting experimental discoveries.Among the set of high-scoring candidate materials, we selected Er 12 Co 5 Bi and Gd 12 Co 5 Bi to characterize as thermoelectric materials due to their facile synthesis through arc melting, and due to the fact they are chemically quite distinct from known thermoelectrics (Fig. 1).While the RE 12 Co 5 Bi (RE = rare earth) family of compounds has only been sparsely studied in the literature, their crystal structure and initial low-temperature electrical and magnetic properties have been reported by Mar and coworkers. 44The crystal structure of RE 12 Co 5 Bi is shown in Figure 4.
Interestingly, the crystal structure of our candidate thermoelectric exhibits notable similarity to the structures of known thermoelectrics, in spite of the fact that crystal structure was not an input feature for our recommendation engine.Ho 12 Co 5 Bi is the eponymous structure prototype (orthorhombic, space group Immm) adopted by a series of rare-earth intermetallics RE 12 Co 5 Bi (RE = Y, Gd, . . ., Tm).In this structure, the Ho 12 Bi icosahedra play an analogous role to the LaP 12 icosahedra in the filled skutterudite prototype LaFe 4 P 12 ; rare-earth atoms "rattling" within their 12-fold coordinated cages is the idiosyncratic feature of filled skutteru-TABLE I. Several promising new thermoelectric compounds selected from our pre-computed list.The P values refer to the engine's confidence level that a given material will exhibit a room-temperature value for a particular property (e.g., S or ρ) within the target ranges specified above.The full compound list is available for exploration at http://thermoelectrics.citrination.com.dites that imparts low thermal conductivity so prized in thermoelectric materials.In fact, if the transition metal atoms, which occupy different sites in these structures, are disregarded, the Ho 12 Bi framework is an antitype to the LaP 12 framework, with the roles of the rare-earth and We also include the recommendation engine's confidence levels for the first three properties; the lowest-probability property, the Seebeck coefficient, is indeed found to be below the 100µV K −1 threshold.group 15 elements reversed.We hypothesize its crystallographic similarity to skutterudite could be partly responsible for the thermoelectric behaviour of RE 12 Co 5 Bi (RE = Gd, Er).

Material
We give a full thermoelectric characterization of Er 12 Co 5 Bi and Gd 12 Co 5 Bi in Fig. 5. Based on these results, we report the discovery of a new thermoelectric class, which remains a completely unoptimized, pure bulk material and thus lends itself to further study.No-tably, the material falls far outside the usual search space for thermoelectrics (Fig. 1 and Fig. 2), and was neither the result of simple interpolation between known compounds nor obvious from a strict chemical intuition standpoint.The electrical resistivity is commensurate with other high-performing materials such as chalcogenides, although the Seebeck coefficient is too low for the material to be competitive with the best-known thermoelectrics.Furthermore, the thermal conductivity is relatively high, but the filled cage structure lends itself to substitution that has successfully reduced thermal conductivity in the skutterudite systems. 3,45In RE 12 Co 5 Bi (RE = Gd, Er), the thermal conductivity from 300 K to 800 K ranges from 4 W m −1 K −1 to 8 W m −1 K −1 , comparable to the half-Heuslers. 46,47Note that these results are consistent with the engine's predictions (Fig. 5); the models give a high probability of achieving the thresholds for electrical conductivity (a) and thermal conductivity (c) (see confidence bar insets), while also suggesting a low probability of observing a large Seebeck coefficient (b).The electrical performance figure of merit κzT is around 0.03 W m −1 K −1 at 400 K, which is actually higher than that of nearly 30% of the thermoelectrics in the Gaultois et al. thermoelectrics database; 8 of course, the database is a highly self-selected set of materials, consisting of literature-reported thermoelectrics, and would skew toward much higher κzT values than would a random subset of all crystalline materials.We note, of course, that the zT of several other thermoelectric materials can be significantly improved through carrier concentration tuning and microstructural engineering.For example, undoped polycrystalline Si has a 60fold increase in performance after optimization, going from zT < 0.01 to 0.6 at 300 K. 48 Another observation from Fig. 5 illustrates the scientific boon of studying entirely new classes of materials.Unexpectedly, RE 12 Co 5 Bi (RE = Gd, Er) exhibits increasing thermal conductivity with temperature.(We note the recommendation engine successfully chose a material with a low thermal conductivity at room temperature, which would normally decrease with increasing temperature.)The increasing electrical resistivity with temperature indicates metallic electrical transport, so the electrical contribution to the total thermal conductivity should therefore decrease with increasing temperature.Additionally, the phonon contribution to thermal conductivity should also decrease with increasing temperature due to more phonon-phonon (Umklapp) scattering. 49Thermal conductivity is calculated from the following relation: κ = α ρ C p , where α is thermal diffusivity, C p is heat capacity, and ρ is density.Normally, thermal diffusivity has a negative temperature dependence whereas heat capacity and density both have positive temperature dependence.However, for this compound we observe a positive temperature dependence for the thermal diffusivity even after multiple measurements, the origin of which is not presently understood.Materials with increasing thermal conductivity with tem-perature are rare, though not unprecedented, 50,51 and further studies on this class of compounds to shed light on this anomaly could thus lead to new strategies for thermoelectric materials optimization.

IV. CONCLUSIONS
This initial experimental validation of our recommendation engine is encouraging.The present work represents the first time that machine learning has been used to suggest an experimentally viable new compound from true chemical white space, where no prior characterization had hinted at promising chemistries.The implication is that our approach-wherein a data-driven computational tool directly augments experimental capabilities and intuition-is a semi-rational way to discover new materials families that may have desirable properties.We suggest that such an paradigm could eventually replace trial-and-error and fortuity in the search for new materials across a wide variety of application areas.

FIG. 3 .
FIG. 3. Leave-one-out cross validation error histograms for the four key properties estimated by our recommendation engine: (a) Seebeck coefficient; (b) electrical resistivity; (c) thermal conductivity; and (d) band gap.For each material in our training set and each property, the recommendation engine gives a confidence score between 0 and 1 that the property value falls within the ideal windows we have defined for thermoelectric applications.Errors approaching +1 represent false negatives (our engine was extremely confident the material would be poor for that property, but the property is actually good); and an error of −1 is a false positive (our engine was extremely confident the material would be good for that property, but the property is actually poor).The peak around 0 for each property shows that the engine generally gives confidence values very close to unity for materials possessing properties in the desired ranges, or close to zero for materials whose property values fall outside the target range.
FIG. 4.(a) Crystal structure of RE12Co5Bi (prototype Ho12Co5Bi), of which Er12Co5Bi and Gd12Co5Bi are exemplars.(b) Crystal structure of the filled skutterudites, which have the generic chemical formula AM4X12.These two structure types share an icosahedral motif consisting of RE12Bi and AX12 units, respectively.

FIG. 5 .
FIG. 5.Thermoelectric characterization of RE12Co5Bi (RE = Gd, Er).(a) Electrical resistivity, (b) Seebeck coefficient, (c) thermal conductivity, and (d) thermoelectric figure of merit zT as a function of temperature.We also include the recommendation engine's confidence levels for the first three properties; the lowest-probability property, the Seebeck coefficient, is indeed found to be below the 100µV K −1 threshold.