Assessing the frontier: Active learning, model accuracy, and multi-objective candidate discovery and optimization

Discovering novel chemicals and materials can be greatly accelerated by iterative machine learning-informed proposal of candidates—active learning. However, standard global error metrics for model quality are not predictive of discovery performance and can be misleading. We introduce the notion of Pareto shell error to help judge the suitability of a model for proposing candidates. Furthermore, through synthetic cases, an experimental thermoelectric dataset and a computational organic molecule dataset, we probe the relation between acquisition function fidelity and active learning performance. Results suggest novel diagnostic tools, as well as new insights for the acquisition function design.


I. INTRODUCTION
Accelerated design, optimization, and tuning of chemicals and materials via machine learning is receiving increasing interest in science and industry. A major driver of this interest is the potential to reduce the substantial cost and effort involved in manual development, synthesis, and characterization of large numbers of candidates. The primary aim is to reduce the number of both failed candidates and development cycles.
A data-driven approach to achieve this acceleration is active learning (AL), 1 an iterative procedure in which a machine-learning model suggests candidates, a selection of which are synthesized, characterized, and fed back into the model to complete a learning iteration. The objective of this procedure varies; in chem-and materials-informatics, the objective is often to identify promising candidates by optimizing properties of interest. Recent work has leveraged AL for candidates when there is a single objective. [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17] However, new issues arise when optimizing multiple objectives simultaneously, as is frequently the case beyond proof-of-principle settings.
Furthermore, the aim to identify promising candidates is distinct from the aim to improve model accuracy, a frequent objective in AL. 18 In fact, model accuracy can be at odds with acquiring optimal candidates: Figure 1 demonstrates that the usual global notion of model accuracy is not necessarily associated with optimal chemicals and materials discovery. In this work, we introduce a notion of model accuracy more closely associated with rapidly discovering new candidates.
We offer two primary contributions: First, we introduce novel concepts to judge the performance of multi-objective AL, with FIG. 1. Best global error does not guarantee optimal candidate discovery. Shown are global prediction errors (MNDE, top) and number of optimal candidates found (bottom). The MPJE decision criterion (dashed line) has lower errors but leads to fewer optimal candidates than the MPND criterion (solid line). Simulated active learning on a thermoelectric dataset. Estimated mean performance over repetitions (Sec. IV). MNDE = Mean Non-Dimensional Error (7), MPJE = Maximum Probability of Joint Exceedance (11), and MPND = Maximum Probability Non-Dominated (13).
an aim toward the specific concerns of candidate discovery. In addition to the usual notion of multi-objective optimality 19 -nondominance-we use criteria informed by the concept of stratarecursive non-dominance-from the database literature. 20 We also introduce scoped error metrics, which emphasize regions of interest in performance (property) space, particularly bands about the Pareto frontier we call Pareto shells. We demonstrate that the usual global error provides less usable signal for candidate discovery performance, while Pareto shell error-under certain conditionscorrectly signals when AL will likely identify performant candidates. Second, we compare multi-objective acquisition functions (also known as improvement criteria) in terms of their quantitative and qualitative performance. Specifically, we compare a collection of exploitation/exploration-navigating acquisition functions: Probability of Joint Exceedance (PJE), Hyperplane Probability of Improvement (HPI), and Probability Non-Dominated (PND).
These acquisition functions differ in terms of the fidelity with which they represent the Pareto frontier, allowing us to study how this affects AL performance.
In this work, we review AL for single-objective candidate discovery and relevant concepts from multi-objective optimization and synthesize scientifically relevant concepts for judging AL performance. We introduce a family of acquisition functions (decisionmaking strategies) and compare their performance across several synthetic datasets and a real thermoelectric dataset. Our results illustrate how new concepts can help articulate the difference between AL strategies and how model accuracy does-and does not-relate to AL performance in discovering novel candidates.

II. ACTIVE LEARNING FOR SINGLE-OBJECTIVE CANDIDATE DISCOVERY
AL is a specialized problem setting in machine learning related to the optimal experimental design. 1,11,21 In the chemical and materials science context, consider a description x of a candidate with corresponding observed property of interest y, thought to be linked by an unknown function f : x ↦ y, which is expensive to evaluate, for example, synthesizing and characterizing a candidate. To systematically identify novel candidates x with desirable changes in y, a statistical (machine-learning) modelf is built that predicts the unknown function f. Trained on an initial set of characterizedX 0 = {x 1 , . . . , xn 0 } with measured propertiesỸ 0 = {y 1 , . . . , yn 0 }, the initial modelf 0 is used to select new candidates x n 0 +1 , . . . , xn 0 +n 1 . These are characterized and added to the dataset,X 1 = {x 1 , . . . , xn 0 +n 1 }, which is then used to train a new improved modelf 1 . This results in a sequence of datasetsX k ,Ỹ k for k = 0, . . ., K. While this cycle can, in principle, be fully automated, new candidates are usually selected by domain experts based on AL suggestions ("human-in-the-loop").
Various objectives can drive the design of AL strategies; for instance, a common objective in general machine learning is to improve the modelf . In informatics, the objective is often to identify an improved material x using as few physical experiments as possible. 6 We call the design space of experimentally accessible, that is, synthesizable and measurable, candidates the global scope and call the error computed on this scope global error.
For a single objective, measuring the relative performance of candidates is straightforward (Fig. 2) as they can be ranked unambiguously based on their scalar performance, for example, the strength-to-weight ratio for structural alloys. One way to measure the performance of AL is to retrospectively simulate AL on a known dataset and count the number of AL iterations necessary to reach a known top candidate. 6 A key decision in designing an AL method is the choice of acquisition function. We will discuss these in greater detail in Sec. III; briefly, an acquisition function is a decision rule used to rank potential candidates in an AL context. Such criteria navigate the "exploration-exploitation" trade-off: 22 The algorithm should seek improved candidates but should also try "risky" candidates to improve its model and enable later discoveries. 18 Many improvement criteria for the single-objective case have been proposed and tested in the literature. 2,6,[14][15][16]23 The multi-objective setting, however, requires a more nuanced understanding of candidate ranking.

FIG. 2.
Multi-objective optimization requires ranking concepts beyond "greater" and "lesser." Illustration of candidate ranking settings (top) and an example multi-objective frontier (bottom). In the single-objective setting, a relative ranking between candidates is always possible via the total ordering induced by the single objective. However, in the multi-objective setting, two candidates can be neither greater nor lesser than the other if they are alternately dominant along different objectives-two distinct candidates can be non-dominated. Recognizing this issue leads to the notion of a non-dominated set of mutually incomparable candidates-the Pareto frontier (see Fig. 12

A. Multi-objective optimization background
In multi-objective optimization, the categories of "greater" and "lesser" are insufficient (Fig. 2). Since two candidates can compete along multiple axes, it is possible for them to be mutually non-dominated. Multiple objectives occur naturally in chemical and materials science problems and are often unavoidable, e.g., the strength-toughness trade-off, which arises from fundamental, competing effects. 24 In lieu of more preference information, one must navigate the resulting multi-objective trade-off space. Having access to more candidates in this trade-off space enables more complete scientific understanding and more informed engineering decisions. In this section, we review concepts from multi-objective optimization and introduce acquisition functions for the multi-objective setting.

Dominance and strata
In the single-objective case, candidates y, y ′ ∈ R can be unambiguously ranked: A candidate y is either lesser y < y ′ , greater y > y ′ , or equal y = y ′ to another candidate y ′ . However, the introduction of multiple objectives y, y ′ ∈ R D introduces new complexities. Figure 2 illustrates the universe of possible comparisons; in addition to lesser, greater, and equal, in the multi-objective setting, there exist non-dominated points 19 (Chap. 12). Dominance (in the Pareto sense) is a pairwise relationship; the candidate y is said to dominate y ′ if (1) We use the notation y ′ ≺ y to denote that y dominates y ′ . Note that the definitions above pre-suppose that optimization is posed in terms of maximization of all objectives. This is not a limitation-if minimization is desired for a given axis y d , then one must simply reverse the relevant inequalities in definition 1. Furthermore, proximity to a desired value can be encoded as minimization of the absolute distance from the target value, e.g., bandgap as close as possible to 1.2 eV. In the multi-objective setting, the possibility of non-dominance implies that a single "best" multi-objective output value y may not exist. Instead, there may be a set of "best values." Given a set of candidates A ⊆ R D , the set of non-dominated points is called the Pareto frontier, defined by The Pareto frontier P(A ) represents the set of trade-offs one must navigate in choosing an optimization candidate. Further selection must be made through external considerations: possibly a prioritization of the objectives or through harder-to-quantify concerns such as the corrosion resistance of a material 25 (Chap. 5). While the Pareto frontier is an important set of candidates, points outside the frontier are not without utility, particularly if aforementioned external concerns exist. Candidates near the frontier can be useful as training data for a machine learning model, while measurement or model uncertainties may lead to the false classification of a point as dominated. To describe points near the Pareto frontier, we use the notion of strata.

ARTICLE scitation.org/journal/jcp
The strata are defined via a recursive relationship. 20 Let A be a set of candidates as above and define the s-th stratum Ss via Figure 3 illustrates a few strata on a thermoelectric dataset (introduced in Subsection IV A). Note that by definition, we have Si ∩ Sj = ∅ if i ≠ j. This allows us to define a stratum number for each point y ∈ A via

FIG. 3. A recursive generalization of non-dominance (strata) enables a definition of scope tailored for optimal material candidates.
Example of strata (left) and a Pareto 2-shell (right) for the thermoelectric dataset, considering only ρ, κ for preference ranking (both minimized). The first stratum corresponds to the Pareto frontier, while higher strata are the Pareto frontier of the previous frontier's remainder [Definition (5)]. Considering all strata up to a number s yields the Pareto s-shell. On the right, we visualize an example 2-shell, which consists of all the points in the first two strata. Below, we use the concept of Pareto shells to define an error scope relevant to AL for candidate discovery.
The candidates along the Pareto frontier then have s(y) = 1, while points with a larger stratum number lie further from it. We will use the stratum number to rank the performance of AL with greater resolution than counting frontier points alone.
We also define the Pareto s-shell via This definition allows one to select a "band" of points along and near the Pareto frontier. Below, we will use the Pareto shell Ps as a targeted scope for computing model accuracy. We use the nomenclature s-shell to denote Ps: Figure 3 illustrates a Pareto 2-shell for the thermoelectric dataset.
B. Proposed performance measures

Candidate improvement performance
Prior work assessing AL in informatics has focused primarily on the number of Pareto frontier points acquired during AL. 6,26 While relevant, this "all or nothing" measurement of success provides no accounting for candidates near the Pareto frontier, which may still be of scientific interest. This is particularly troublesome when studying datasets that have very few Pareto frontier points, such as the "sparse" frontier we will consider below. To provide a measure with more granularity, we consider the mean stratum number at each iteration of ALỸ k using the ground truth strata from the full dataset.

Model accuracy performance
In addition to the candidate improvement performance of AL, we also consider trajectories of model performance. We use the notion of non-dimensional error (NDE) across each of the output quantities; given a set of true =Ŷ response values, we compute the NDE for the output d via where y d = 1 n ∑ n i=1 y i,d is the sample mean. If we have D output quantities, then there are D NDE values to compute. Note that the NDE is closely related to the coefficient of determination R 2 via the expres- 27 Since the NDE is dimensionless, we may safely average NDE values across the D outputs to compute a mean non-dimensional error (MNDE), given by Given the retrospective nature of our test cases, the global error is naturally evaluated by (6) across the entire dataset. However, this global scope includes regions of output space that are not emphasized in our frontier-seeking context. To compute error metrics with targeted scope, we use the notion of a Pareto shell to compute error on the relevant portions of the output domain (Fig. 3). This notion of scope is closely related to the concept of a domain of applicability

C. Acquisition functions for benchmarking
To assess the performance measures above, we must perform active learning with some form of decision criteria that summarizes available multi-objective property predictions and automatically ranks candidates. To this end, we define an acquisition function fa(x; M ) as any function that is used to select an optimal candidate via whereX C is the complement of the training set and M is a trained machine learning model, which returns both a predictionŷ and a predictive distribution Y ∼D, both of which are available to the acquisition function f a.
Many generalizations of single-objective acquisition functions are available from the literature, including the probability of improvement and expected improvement criteria, 29 the max-min criterion, 30 and the expected hyper-volume improvement (EHVI) criterion. 26 Comparing these and other criteria is outside the scope of this work; instead, we are focused on (i) how the fidelity with which the Pareto frontier is represented relates to AL performance and (ii) how optimal candidate selection is (or is not) related to model accuracy. Below, we introduce the acquisition functions to be studied in this work after a remark on dimensional homogeneity.

Importance of dimensional homogeneity
Before introducing specific acquisition functions, we first make a remark on the importance of dimensional homogeneity. In order to measure the performance of multi-objective AL, we may wish to measure the "distance" of candidates to the Pareto frontier. However, we will see here that a naive notion of distance is problematic as it does not respect dimensional homogeneity-the ranking results are not independent of the analyst's choice of the unit system.
To illustrate, let y, y ′ ∈ R D , where the y d potentially have different physical units. The ordinary notion of distance is given by Note that for all p ∈ (0, +∞), this expression involves the addition of terms of potentially different physical units. This expression violates dimensional homogeneity, which introduces an artificial dependence on the chosen unit system. Thus, the ranks computed by distances are not necessarily stable to a change of the unit system-we provide an example of this pathology in the supplementary material. This illustrates that arbitrary choices can drastically affect decisions based on this sort of distance computation.
To overcome this issue, we only consider acquisition functions that respect dimensional homogeneity.

Uncertainty sampling
On the scale from exploitation to exploration, uncertainty sampling leans heavily toward the latter; one chooses candidates based on where the model is most uncertain. In the scalar case, this is easily accomplished by choosing the candidate with the greatest predictive varianceσ 2 i . 18 This approach does not immediately generalize to the multi-objective case as the component variancesσ 2 i,d =Σ i,dd do not necessarily have the same units. To respect dimensional homogeneity, we generalize uncertainty sampling by considering a sum of dimensionless quantities.
Sum of Coefficients of Variation (SCV): The SCV is defined via where the COVi, d = σi, d /|μi, d | are coefficients of variation and μ i,d ,σ i,d are the d = 1, . . ., D components of the mean and standard deviation of the predictive distributionDi. Note that this definition is problematic if any of the μi are exactly zero. However, the coefficients of variation COVi, d are dimensionless quantities with a normalizing scaleμ i,d set not by the user, but rather by the available data.

Frontier modeling strategies
Here, we introduce a family of acquisition functions that seek improvement over an existing Pareto frontier, in the order of increasing fidelity with which they model the Pareto frontier. Each of these strategies is a form of probability statement; by construction, these quantities respect dimensional homogeneity.
Probability of Joint Exceedance (PJE): The PJE is defined via where Yi ∼Di is the random variable, which follows the predictive distributionDi for candidate i. In words, definition (11) is the probability that candidate Yi will exceed the performance observed in our existing training dataỸ along every axis of comparison. This is a very "aggressive" acquisition function that ignores much of the structure of the Pareto frontier. Figure 4 illustrates the PJE, alongside the other frontier modeling criteria (Fig. 5).
Hyperplane Probability of Improvement (HPI): The HPI is defined in terms of a hyperplane fit of the Pareto frontier in the available training data. Modeling the Pareto frontier as a hyperplane requires fitting a normal directionŵ ∈ R D and an offsetb. Once these are fit, the HPI is defined in terms of the appropriate Z-score via In words, the HPI is the probability that a candidate Yi will present improvement over the hyperplane fit of the existing Pareto frontier. There are many ways to fit a hyperplane to a set of data. One option is to arbitrarily select one output Yi and perform linear regression using the remaining outputs as regression variables. We recommend against this approach as it suffers from the regression fallacy 31 (Chap. 9.1). In practice, we perform a principal component analysis of the available Pareto frontier data and use the leastvariance direction to define the hyperplane directionŵ.Together , and probability nondominated (PND, right) for the same candidate (blue, at origin) and frontier points (black). Both outputs Y1, Y2 are to be maximized. The blue curves depict equi-likelihood contours for a single candidate's predictive densityD. The shaded region depicts the region that is integrated for the respective acquisition function. Note that the PJE largely ignores the frontier geometry, the HPI crudely models the Pareto frontier, and the PND accurately considers the Pareto frontier.

The Journal of Chemical Physics
with the mean of the Pareto frontier data Y, we then define the offset viab =ŵ ⊺ Y. Figure 4 illustrates the HPI: Note that this definition assumes a hyperplane structure, which will not be appropriate for all Pareto frontiers. Furthermore, the HPI "fills in" the staircase structure posed by the true Pareto frontier-the final frontier modeling acquisition function captures this structure.
with dominance ≺ defined in (1). In words, the PND computes the probability that a given candidate Yi will be non-dominated with respect to the available data. This criterion is studied in an approximate form in Ref. 29. The PND fully considers the known Pareto frontier, introducing no modeling assumptions. Figure 4 illustrates this acquisition function against the aforementioned definitions. Note that while PND captures the true geometry of the Pareto frontier, since the frontier is allowed to have quite general structure, no simple analytic expression as for the PJE or HPI exists. Instead, one may approximate the PND via ordinary Monte Carlo, drawing samples from the predictive distribution Di and counting the proportion that are non-dominated. This leads to a greater computational expense to evaluate the PND, as compared with the PJE or HPI. The experiments below will demonstrate that this added expense can be valuable if higher AL performance is desired.

A. Test cases
To compare the acquisition functions introduced above, we simulate AL on a collection of synthetic and physical databases. The synthetic cases are constructed to present different Pareto frontier geometries, including a linear frontier, as well as examples of convex and concave frontiers. We also construct a "sparse" frontier containing relatively few low-stratum points. Figure 6 presents these test-cases' two-dimensional output spaces. The functional forms for these models are given in the Appendix.
In the motivating example above (Fig. 1), we considered a published dataset of experimental thermoelectric materials. 32 The performance (property) space consists of thermal conductivity κ, electrical resistivity ρ, and the Seebeck coefficient S. The inputs of Chemical Physics are computed using the Magpie featurization library: We use the matminer package to compute features (descriptors) including stoichiometry, valence orbital, and ion properties as inputs. 33,34 Full code to load this dataset and featurization is available in the supplementary material. We also consider a computational dataset of organic molecules: the QM9 dataset, which contains 134 k small organic molecules with ground-state geometry and several properties computed at the density-functional level of theory. 35 We featurize these data using the Chemistry Development Kit 36 and model the highest occupied molecular orbital (HOMO) and lowest unoccupied molecular orbital (LUMO) properties. This case study is inspired by, but not directly related to, the design of organic photovoltaics. 37 For optimization, we consider maximization of HOMO and minimization of LUMO, corresponding to a minimization of the HOMO-LUMO gap, while seeking a low LUMO value. To promote scalability for replications, we filter the dataset to contain only the first 25 strata, resulting in 2108 molecules. Figure 5 illustrates this frontier.

B. Retrospective active learning experiments
The primary evidence of this work is based on a battery of AL simulations, which support the comparison of different acquisition functions against a common setup. For all experiments, we use the random forest model implemented in the lolo package. 38 Chance phenomena such as initial data selection can affect AL results, implying that single runs of AL are insufficient for measuring performance. We consider an ensemble of runs against a randomized selection of initial data in order to provide a robust estimate of relative performance.
To perform a single run of AL, we carry out the following steps: 1. Choose an initial random subsetX 0 ⊆ X of size |X 0 | = C for the training data and reveal the paired labelsỸ 0 . Set the iteration counter to k = 0. 2. Fit a random forest model to the available paired dataX k ,Ỹ k . This returns predictionsŷ i and predictive densitiesDi. 3. Rank the candidates xi ∈ (X −X k ) remaining in the dataset according to the chosen acquisition function f a(xi). Select the top-performing candidate x * and add it to the databasẽ X k+1 =X k + {x * }. Reveal the label for x * . 4. Repeat from Step 2 for k = 1, . . ., K − 1 total iterations.
To perform an ensemble of AL runs, we select different random subsetsX 0,1 , . . . ,X 0,R for R repetitions and aggregate the results across these runs. We vary the initial candidate pool size C, the total number of iterations K, the considered test case, and the chosen acquisition function. We performed 150 AL runs for each choice of acquisition function and test case: Table I summarizes experimental details, while the supplementary material provides numerical details on C, K, and R.

C. Candidate improvement
Here, we report results on candidate discovery performance. We consider both the usual metric of the number of non-dominated points (NNDP) acquired (Fig. 7) and the proposed measure of the mean stratum number (Fig. 8). Generally, the acquisition functions that acquire the most NNDP are (1) MPND, (2) MHPI, and (3).

FIG. 8. Mean stratum number highlights trends not captured by the number of non-dominated points (NNDP, also acquired frontier points).
The three improvement-based criteria result in similar NNDP on the hyperbolic case (higher is better). However, studying the mean stratum number reveals strong differences: The MPJE and MHPI approaches tend to more quickly approach the frontier (lower mean stratum is better), while the MPND approach steadily acquires lower-stratum candidates. The MPJE approach saturates in performance around 50 iterations, while we do not observe this in the NNDP results. In cases where an analyst cares not just for Pareto candidates but also for near-optimality, the mean stratum number is a higher resolution metric to judge (multi-objective) the active learning performance.

ARTICLE scitation.org/journal/jcp
MPJE-this matches the descending order for the Pareto frontier representation fidelity. The ranking results for the same criteria are mixed for shorter-term performance. These results suggest that capturing the true geometry of the Pareto frontier is most important in "later stage" AL, where the space of candidates is limited. As seen in Fig. 1, MPND also out-performs other criteria on the thermoelectric dataset in terms of long-run performance. These results suggest that incorporating high-fidelity modeling of the Pareto frontier into the FIG. 9. Shell error is more associated with candidate improvement than global error. Different error scopes computed on the same set of active learning replications suggest different rankings of model accuracy, associated with the chosen acquisition function. As seen in Fig. 1 and here, for the linear case, global error can be misleading when considering a model for active learning performance. For the linear frontier (top row), the MPJE approach initially reduces the 3-shell error but saturates. This trend is reflected in the mean stratum number as the MPJE continues to acquire points distal to the frontier. Conversely, the MHPI and MPND approaches continue to decrease their 3-shell error and maintain a low mean stratum number. Note that this association is not predictive; on the linear case, the MHPI and MPND strategies do not overtake MPJE in 3-shell error until well after the latter has saturated. The association between acquisition performance and shell error is not perfect: The QM9 data (bottom row) show that 1-shell error erroneously favors MPJE but correctly favors MHPI. While an improvement over global error, is it clear that more factors are salient in candidate improvement than the frontier-based scope alone.

ARTICLE scitation.org/journal/jcp
acquisition function is particularly important in well-studied problems with few possible candidates. Finally, note that the hyperbolic test-case has only five non-dominated candidates total-this fundamentally limits the resolution of AL results for this case. Figure 8 analyzes the same AL results in terms of the NNDP and mean stratum number, demonstrating how the two metrics indicate different trends.
The difference in the "early stage" performance is better revealed by the mean stratum results (Fig. 8). In the early stages of AL, MHPI tends to give the best performance, delivering more candidates at or near the Pareto frontier, while MPND consistently achieves long-run performance. Note that MPJE performance tends to saturate earlier than the other criteria, rather than ranking second when the performance is measured with the NNDP metric-this suggests that MPJE begins selecting candidates far from the Pareto frontier in later-stage AL. The reasons for these differences are explained in Subsection IV E, which analyzes qualitative performance.

D. Accuracy and acquisitions
Here, we report results on how model accuracy relates to acquisitions during AL. Figure 1 illustrates that global scope error can be misleading for judging the relative performance of active learning techniques; Figure. 9 compares shell and global error with acquisition performance. The results on the analytically simple linear case indicate that shell scope error is more closely associated with the acquisition of higher-performing candidates.
However, in Fig. 9, we also see complicating factors: Results on the QM9 dataset show that shell error erroneously highlights MPJE as performant but correctly identifies MHPI as performant. Note that the shell scope is based solely on recursive non-dominance, a feature of output property space. Other factors, such as structureproperty relationships, are only accounted for indirectly through a shell error calculation. These results indicate that while shell error is more informative for AL performance, the shell scope is evidently (and intuitively) not the only salient factor.

FIG. 10.
Higher-fidelity acquisition functions explore the frontier more thoroughly. Density of selected points across all AL runs on the Circular and Bat cases, faceted by acquisition functions; higher density indicates that the acquisition strategy "visits" that region more frequently over independent replications. The selections made according to MPJE are scattered among the entire domain, as compared with MHPI and MPND. MPND tends to distribute its selections along the entire Pareto frontier, as opposed to MHPI, which concentrates on "hot spots" within the frontier. MPND also has wider support than MHPI, indicating that the former criteria tend to (occasionally) select more strongly dominated candidates than the latter. Here, we report how the acquisition functions of Sub section III C behave in terms of qualitative performance, where their selections tend to lie in output space. Figure 10 reports empirical densities for selected candidates in output space. Broadly, MPJE tends to select candidates throughout the output space, MHPI focuses on "hotspots" along the frontier, and MPND thoroughly explores the frontier but infrequently selects strongly dominated candidates. This qualitative analysis helps to explain the difference in NNDP (Fig. 7) and mean stratum (Fig. 8) performance: MPND is able to explore the frontier thoroughly, selecting many non-dominated points. However, MPND is also more liable than MHPI to select strongly non-dominated candidates, leading to a higher mean shell number. Conversely, MHPI tends to concentrate its selections closer to a limited region of the Pareto frontier, leading to fewer possible non-dominated selections but a lower mean stratum.

V. CONCLUSIONS
In this work, we explored the relationship between different aspects of machine learning accuracy and candidate acquisition in multi-objective candidate discovery. We showed that AL schemes which optimize for the usual notion of model accuracy-global error-do not guarantee optimal candidate discovery. To ameliorate the situation, we introduced Pareto shell error, which we found to be more closely associated with discovering improved candidates.
We also studied the relationship between the fidelity with which acquisition functions represent the Pareto frontier and AL performance. We demonstrated that dimensionally inhomogeneous acquisition functions can lead to non-robust decision making and so limited our attention to dimensionally homogeneous acquisition functions. We found that the long-run discovery of non-dominated candidates was improved by modeling the Pareto frontier with greater fidelity. However, an acquisition function, which rendered the Pareto frontier with lesser fidelity (MHPI), uncovered more candidates at or near the Pareto frontier in the early stages of AL, leading to a lower mean stratum number than the highest fidelity acquisition function (MPND). We found that these acquisition functions tended to select material candidates at varying locations along the Pareto frontier with different frequency, leading to preferred "hot-spots" in material property space.
These results have ramifications for scientists seeking to use AL for candidate discovery. Since global error is not always predictive of optimal candidate discovery, an analyst should check both global and shell error when deciding whether a model is sufficiently accurate to be used to rank candidates. Furthermore, an analyst can use these insights prescriptively, choosing hyperparameters to optimize shell error rather than global error.
The selection of an appropriate acquisition function is highly dependent on the analyst's goals. Based on our results, accurately modeling the Pareto frontier in an acquisition function is not critical for early stage AL, that is, if the problem has a very large set of uncharacterized material candidates. Accurately capturing the Pareto frontier in the acquisition function logic becomes important when the design space is more thoroughly explored. In the absence of more specific preference criteria than non-dominated, the maximum probability non-dominated (MPND) strategy is the most performant among the acquisition functions tested here.
There are a number of remaining questions related to the present topic. By the simulation nature of our experiments, we were able to evaluate the true Pareto shell error; in practice, one must devise an estimation technique based on the available data. Given the suitability of different acquisition functions for different phases of active learning, a homotopy approach that weights different objectives λf 1 + (1 − λ)f 2 with λ varying between iterations may be a useful framework. In this work, we fit independent models to all output quantities: Prior work suggests that modeling improvements can be made by accurately capturing the output dependence structure among output quantities. 39,40 It would be interesting to study whether similar improvements can be found in AL performance. A conceptually different way to treat the multi-objective setting is to turn some objectives into constraints; it would be interesting to compare the constrained approach against a multi-objective acquisition function in terms of the mean stratum number (cf. Fig. 8) and acquisition densities (cf. Fig. 10). Similarly, it would be interesting to compare performance across AL strategies using different scalarizations of the multi-objective space, e.g., the thermoelectric figure of merit zT.

SUPPLEMENTARY MATERIAL
See the supplementary material for full code to reproduce the datasets used in this study.

APPENDIX: TEST CASE DETAILS
The following are the underlying distributions and function used to generate the synthetic data cases. These test cases were designed to provide a variety of Pareto frontier geometries. For all four synthetic test cases, the output space is two dimensional Y ∈ R 2 , and both outputs are to be maximized. Code for reproducing these datasets is provided in the supplementary material.

Linear test case
A simple linear frontier geometry, generated by rotating (and stretching) a uniform distribution,

Circular test case
A circular frontier geometry, generated via trigonometric functions,

Hyperbolic test case
The hyperbolic responses lead to a far greater density of points near the origin. This results in a "sparse" Pareto frontier, which often has very few non-dominated candidates, Xi ∼ U [0, 10] 2 , Yi = [1/X i,1 , 1/X i,2 ] ⊺ . (A3)

DATA AVAILABILITY
The data that support the findings of this study are available within the article (and its supplementary material).