Research Update : Computational materials discovery in soft matter

Soft matter embodies a wide range of materials, which all share the common characteristics of weak interaction energies determining their supramolecular structure. This complicates structure-property predictions and hampers the direct application of data-driven approaches to their modeling. We present several aspects in which these methods play a role in designing soft-matter materials: drug design as well as information-driven computer simulations, e.g., histogram reweighting. We also discuss recent examples of rational design of soft-matter materials fostered by physical insight and assisted by data-driven approaches. We foresee the combination of data-driven and physical approaches a promising strategy to move the field forward.


I. DATA-DRIVEN MATERIALS DESIGN
The fundamental laws of physics and chemistry provide us with constitutive laws and equations which can be used to predict material properties and to link them to the chemical composition and processing conditions.In materials design, predicting properties from chemical structures is termed as a forward problem (see Fig. 1).We routinely rely on experimental measurements, computer simulations, analytical theory, and statistical modeling to solve the forward problem, i.e., to map structure to function.
The design of materials, on the other hand, amounts to identifying the adequate structure given properties of interest. 1,2Unfortunately, we do not have physical laws to tackle the backward problem.One could of course attempt a brute-force solution and try to solve the forward problem for as many compounds as possible.An exhaustive account, however, bears no hope: the number of drug-like small (less than about 500 atomic mass units) organic molecules alone has been estimated in the range of 10 60 . 3,4As a result, the empirical backward solution is in many cases driven by our chemical and physical understanding of the system.
More recently, researchers have started exploring an alternative strategy to tackle this manycompound problem.6][7] If so, one can solve the time-consuming forward problem for a small subset of compounds and predict the properties of the others by effectively interpolating across the training subset 8 (see Fig. 1).Interpolation is performed by first identifying correlations between simple chemical features, or descriptors (e.g., molecular mass, size, or number of hydrogen bonds) and said properties, 9 resting on the assumption that simple descriptors can be extracted.The correlations drawn then call for a qualitative validation based on thorough physical and chemical understanding, thereby avoiding artifacts due to noisy or erroneous data.Ultimately, this yields an empirical backward optimization route, in which the desired property can be achieved by tuning the chemistry along the descriptor (see magenta in Fig. 1).
Given that the correlations between structural descriptors and chemical properties are often nonlinear, 10 interpolation schemes call for advanced statistical methodologies, e.g., machine learning.Machine learning consists of a set of classification and regression schemes whose predictive performance improves with increasing training set size. 11,12This motivates the development of solutions that can faster explore portions of chemical space by reducing the computational investment associated with each compound.Two main avenues have traditionally prevailed: technological improvements of computer hardware 13 and the development of more efficient algorithms. 14,15n contrast to the early successes of machine learning to predict electronic properties of molecules, [16][17][18] phenomena where thermal fluctuations play an essential role have proven much more challenging.Soft matter systems, for which the characteristic interaction energies are comparable to thermal energy, can lead to fluctuations over a wide range of length (∼nm to µm) and time (∼ps to s) scales. 191][22][23][24][25] The present perspective discusses the challenges associated with designing soft-matter materials using computational tools.We highlight two sources of difficulties: (i) the accurate prediction of thermodynamic properties and (ii) the identification of clear descriptors for backward optimization.These difficulties are so severe that computational materials design has made relatively little progress in soft matter compared to other systems, e.g., thermoelectrics, 26 crystal-structure prediction, 27,28 or electrocatalytic materials. 29The present review aims at bridging two largely disconnected fields, namely, computational materials design and soft matter, expected to rapidly grow closer with hardware and methodological developments.We illustrate the issues associated with predicting soft-matter materials by means of examples in the field of computational drug design as well as provide examples of rational design guided by our understanding of the physics and chemistry, assisted by data-mining.We finally conclude by proposing an outlook on data-driven methods in soft matter.

A. Statistical mechanics and computer simulations
To better understand the challenges associated with predicting properties of soft-matter systems, we first recall basic aspects of statistical mechanics and computer simulations of molecular systems.
While quantum and classical physics provide laws to describe the evolution of all degrees of freedom in the system, large systems-on the order of 10 23 and more molecules-call for coarser approaches.Statistical mechanics provides a framework to link macroscopic parameters with the Fixing different macroscopic parameters will give rise to specific statistical ensembles.For instance, the canonical ensemble describes a mechanical system at fixed temperature, volume, and number of particles.The probability of sampling microstate i at temperature T is then proportional to the Boltzmann distribution where β −1 ≡ k B T, k B is Boltzmann's constant, E i is the energy of microstate i, and Z-the partition function-normalizes the probability.The canonical ensemble thus weighs each microstate according to the corresponding Boltzmann factor.Rather than studying a single microstate, one typically probes a (large) subset S that represents specific values of coarse-grained coordinates of the system, e.g., the folded state of a protein.In this case, the stability of this subset is often described in terms of its (projected) free energy where S encapsulates the collection of microstates corresponding to the subset of interest.The free-energy difference between different subsets, S 1 and S 2 , provides a measure of their relative stability: ∆F = F(S 1 ) − F(S 2 ).Practically, evaluating Eq. ( 2) requires the energy of every microstate of the system, which then permits a direct evaluation of observables at a particular state point.Unfortunately, the sheer number of microstates is often colossal, making these configurational integrals intractable for all but the simplest of systems.
Instead of exhaustively enumerating microstates, computer simulations tackle the problem by sampling the most relevant regions of the statistical ensemble.How can one sample efficiently out of an enormous pool of microstates?A simple but extremely efficient procedure draws microstates according to the Boltzmann distribution itself (Eq.( 1)).This importance sampling scheme has given rise to the two main types of molecular simulations: Markov chain Monte Carlo with Metropolis update and molecular dynamics. 30,31oth algorithms rely on an evaluation of the system's energies for each microstate.In soft matter, the system is often modeled by a multi-dimensional potential energy surface, expressed in terms of interatomic interaction potentials that need to be parametrized (i.e., the so-called force field).A Monte Carlo simulation samples states by proposing a new trial system configuration at each step.It then evaluates the energy difference between subsequent configurations and finally accepts or rejects the trial configuration based on the Boltzmann factor of this energy difference.Molecular dynamics, on the other hand, numerically integrates the classical equations of motions of the system according to the interatomic forces (i.e., spatial derivatives of the interaction energies).
The predictability of (classical) computer simulations depends on three main aspects: the modeling of the relevant physics, the accuracy of the force fields, and the simulation time.We avoid discussions of the first topic-entirely system-dependent-to focus on the other two.The ability of force fields to accurately represent the energy landscape is of critical importance.3][34] Collaborative open-data approaches to improvements, testing, and validation of force fields provide a novel framework to refine force fields 35 and could later take advantage of a data-driven approach.Beyond accuracy, force-field-based methods rely on their ability to quickly evaluate the energy or forces of a configuration, thus sampling a representative ensemble of configurations.Failure to thoroughly sample the statistical ensemble easily leads to drastic errors and artifacts in the derived thermodynamic quantities. 36The simulation time achievable for a given system is inherently limited by the computational power at hand.The last three decades have yielded a roughly exponential increase available in simulation time for fixed system size. 15Though the most straightforward approaches to molecular dynamics would scale computational power with the square of the number of atoms, N 2 (i.e., commensurate with the number of interactions), the use of interaction cutoffs and neighbor lists can reduce the scaling to N ln N, typical of long-range electrostatics algorithms based on fast-Fourier transforms. 30hile the performance achieved strongly depends on many parameters (e.g., hardware, algorithms, or system), several benchmarks report values in the 10 to 100 ns/day for solvated proteins with N ∼ 10 4 -10 5 using roughly 10 2 -10 3 -CPU cores. 37,38Dedicated hardware, which is designed specifically to run molecular dynamics, has shown significant improvements over more common high-performance-computing solutions. 34,39,40The recent development of graphics processing units (GPUs) for scientific computing has provided wider access to significant computational performance, as a modern GPU can emulate the performance of 10-100 CPU cores. 38,41Boosting computational performance of a single simulation using parallelization inherently requires large systems, as the system is divided spatially in smaller units.[44][45]

B. Example: Protein-ligand binding in computational drug design
][48] Drugs are (small) molecules that alter certain biochemical pathways by interacting with specific macromolecules, e.g., proteins.The binding of a ligand to a protein pocket will stabilize certain conformational states of the macromolecule, thereby affecting its biological function.Strong binding is thus primordial to ensure significant interactions between the two entities at reasonable ligand concentrations.We note that although other physiological aspects play a major role-e.g., specificity to a single protein receptor, permeability through cell membranes, or time before the drug gets metabolized 49 -we specifically focus on protein-ligand binding.
Stable binding, in terms of the abovementioned statistical mechanics formalism, corresponds to a significant free-energy difference between the bound complex and the weakly (or non-)interacting dissociated pair, see Fig. 2(a).What does this imply at the microscopic level?Remember that free energies average over the individual Boltzmann-weighted energies of many microstates (Eq.( 2)).In terms of sheer numbers of microstates, the subset of conformations that describe the bound state will be extremely small compared to the vast number of geometries describing two unbound molecules.Given the form of the Boltzmann factor, the free-energy difference can only favor the bound state if its few microstates have significantly lower energies compared to the others.The competition between the weight of each microstate and their number-energy and entropy, respectively-completely dictates the performance of a ligand.
Practically, optimizing protein-ligand binding is a formidable challenge: not only is the number of putative drug candidates tremendous (10 60 or so drug-like small molecules 3 ), but a careful characterization of any one of them also requires the evaluation of extensive sums (Eq.( 2)).Classical all-atom computer simulations can provide accurate estimates of the binding free energy (i.e., down to ≈1 k B T error), 50 at the expense of significant computational investment: up to 10-100 compounds in a single study, at most. 50,51This strongly limits the number of trial compounds that can be simulated, such that computer simulations are more typically applied to optimize an already-identified promising molecular scaffold.Identifying the said scaffold instead requires alternative methods.
Given difficulties in accurately simulating binding events for many compounds, virtual highthroughput screening efforts (i.e., which test many compounds) have so far primarily employed statistical modeling. 52The prediction of protein-ligand binding affinity given a protein receptor and the chemical structure of a ligand relies on correlations between physical descriptors and binding stability. 7Establishing correlations requires large datasets to train the statistical models.Given the scarcity of computer simulations, most models are trained against experimental measurements.The predictability of these models has remained extremely limited: typical benchmarks report linear correlation coefficients between predicted and experimentally measured binding affinities around 50%-60%. 53As such, statistical scoring fails to efficiently filter out most ligands to focus on key promising candidates.Ideally, high predictability would ensure a reliable identification of few compounds that would be worth either simulating using accurate force fields or synthesizing in the laboratory.A number of reasons can be brought forward for the limited success of statistical scoring methods: • Experimental measurements carry errors that may strongly vary between different protocols and can be significant compared to the binding affinity itself.• Statistical models rely on examples to learn why certain ligands will bind to a protein.On the other hand, very few compounds will bind significantly to any protein receptor, significantly skewing the amount of information toward poor binders.These low hit rates hinder the training process.• The ligands used in the training set amount to synthesized compounds available in the laboratory.These libraries are not only extremely sparse compared to the size of chemical space (i.e., 10 3 -10 5 vs. 10 60 ); available compounds tend to be extremely correlated and fail to represent the overall diversity of compound chemistry. 54As a result, scoring methods that are trained against these datasets typically show poor transferability between protein receptors and classes of ligands.• Statistical scoring approaches, though often employing advanced methodologies (e.g., machine learning), aim at directly correlating ligand and receptor geometry with binding free energy.Critically, the inclusion of explicit thermodynamics-and most importantly entropic contributions-appears most challenging.
These points highlight the difficulties in correlating geometrical and chemical aspects of soft-matter systems to their thermodynamics.Recently, efforts have turned to deep-learning algorithms to leverage information from many distinct biological sources: multitask networks that simultaneously predict binding against different protein receptors improve the performance, as compared to independent models. 55inally, we note that beyond mere binding constants, computer simulations provide an atomic description of the binding modes and pathways, 45,[56][57][58] providing the means to a rational design of drugs without tens of thousands of training samples, but rather using chemical insight. 39,40Unlike traditional statistical scoring methods, computer simulations also provide receptor flexibility to fully take into account the diversity of binding poses and associated entropic contributions. 59or a more detailed review on the statistical mechanics of protein-ligand binding, see the work of Wereszczynski and McCammon. 60

III. ADVANCED STATISTICAL TECHNIQUES IN SOFT-MATTER SIMULATIONS
The challenges associated with accurately evaluating thermodynamic quantities (Eq.( 2)) of soft-matter systems have limited the advancement of data-driven predictions of new materials.Instead, other types of advanced statistical methods have already provided enormous help in extracting information from computer simulations.Rather than interpolating across many instances of the forward process (Fig. 1), these methods focus on improving the quality of individual samples.In the following, we briefly outline a few examples of data-driven methods that augment computer simulations through enhanced analysis or sampling.

A. Analysis of multiple thermodynamic states
Molecular dynamics or Metropolis Monte Carlo simulations that generate a canonical ensemble will primarily sample microstates that are within a few k B T of the local energetic minimum.The system may hop between distinct local minima, as long as there are no significant free-energy barriers in between, otherwise the system may remain trapped for long times.Enhanced-sampling methods address this problem by altering the distribution function used in generating microstates. 61,62At its simplest, increasing the temperature T (0) → T (1) will provide a wider exploration of conformational space, thanks to more thermal energy.On the other hand, analyzing the higher-temperature simulation T (1) will yield thermodynamic quantities for that particular thermodynamic state, which may not be of interest.Instead, one seeks the evaluation of thermodynamic quantities at temperature T (0) using simulations run at elevated temperatures-systematically disentangling the thermodynamic states used for sampling and analysis.
It helps to consider the canonical probability distribution (Eq.( 1)) in terms of energies rather than states, which highlights the density of states, Ω(E).Ω(E) is a material property of the system-it does not depend on its thermodynamic state.Determination of the density of states provides access to all thermodynamic quantities.Given an infinitely long simulation, one could simply divide out the Boltzmann factor to isolate Ω(E).In practice, the exponentially suppressed tails of a canonical distribution will limit a reliable estimation of the density of states to a narrow energy interval.On the other hand, a simulation run at different temperatures will provide partial-but complementary-information about Ω(E): low (high) temperatures will preferentially sample the low (high) energy regions of the density of states.A series of simulations run at different temperatures will each contribute to estimating the density of states in different energy ranges.Given sampling difficulties, leveraging the information contained in all simulations can dramatically help accurately estimating Ω(E).
Effectively, we seek a statistical model that best reproduces the data provided S canonical simulation trajectories at different inverse temperatures β ( j) = (k B T ( j) ) −1 .Each simulation j records the distribution of energies sampled by means of histograms, H ( j) , each of size N. Since each histogram represents a canonical distribution, the associated (unknown) probability of sampling energy E i is given by p ( j) i = Ω i exp(− β ( j) E i )/Z ( j) , where Ω i ≡ Ω(E i ).Finding the set of probabilities p ( j) i (i.e., optimizing Ω(E)) that best reproduce the data H ( j) corresponds to maximizing the conditional probability P(p ( j) | H ( j) ).From Bayes' theorem, we obtain where the three terms in the numerators are denoted posterior, likelihood, and prior.We drop the denominator, as it simply contributes a normalizing factor.In the following, we further consider all models equally likely-i.e., we apply a constant prior-such that P(p ( j) | H ( j) ) ∝ P(H ( j) | p ( j) ).
Though the posterior is, in general, difficult to evaluate directly, we can express the likelihood as a multinomial distribution (assuming no correlation between histogram counts) In other words, the model that best reproduces the data would maximize the likelihood given by the set of probabilities applied to the content of the sampled histograms.Maximizing the likelihood function is more easily performed by working with its logarithm.Further, we add Lagrange multipliers, λ (k) , to constrain the resulting models to yield normalized probabilities.The resulting log-likelihood function reads where f ( j) = ln Z ( j) .Maximizing L is obtained by setting its derivatives against f ( j) , λ ( j) , and Ω i to 0, which yields the set of self-consistent equations and Eq. ( 8) provides a minimum variance estimator for the density of states from the S simulation trajectories.This weighted histogram analysis method 63,64 inscribes itself within a series of methods known as extended bridge sampling estimators, reviewed by Shirts and Chodera. 65[68]

B. Markov State models
Beyond static equilibrium properties, computer simulations aim at a kinetic understanding of molecular processes.Solute-solvent reorganization, protein-ligand binding, and protein folding all operate at different time scales. 69Among the many ways one can analyze the kinetics of a molecular system, Markov state models systematically link microscopic transitions with the slow processes of the system (e.g., the folding of a protein or the flip-flop rate of a lipid in a bilayer). 44,70These kinetic models describe the evolution of a system in terms of microstate transitions under a Markovian (i.e., memoryless) assumption.The construction of these Markov state models typically corresponds to a Bayesian framework similar to the one presented above: a probabilistic model of the transition probability matrix is optimized to best reproduce the underlying simulation trajectory, i.e., a count matrix of state-to-state transitions.Though normalizing this count matrix should formally yield the transition probability matrix, finite sampling often leads to inconsistencies between forward and backward microscopic transitions-violating detailed balance.Enforcing detailed balance in the transition probability matrix greatly helps alleviating finite-sampling limitations.The ability to consistently reconstruct robust kinetic models from several simulations has opened a new paradigm in computer simulations: extremely long time scales (up to ms time scale) may be reconstructed from many short simulations that collectively sample the relevant part of conformational space. 24,43 Enhanced sampling from Bayesian inference Finally, we mention a recently proposed enhanced sampling scheme based on Bayesian inference that balances contributions from a computer simulation and structural biology data.Mac-Callum et al. 71 sample the system from a biased energy function, E bias = − β −1 ln P(x | D), which corresponds to a probabilistic model of configuration x that best reproduces external data D. From Bayes' theorem (Eq.( 4)), they express E bias in terms of a prior, which corresponds to the Boltzmann-weighted energy as obtained from the force field, while the likelihood incorporates ambiguous or uncertain experimental information, D, by means of restraints.The combination of physical models with experimental information efficiently leads to the correct structure of several proteins.We emphasize here that while experimental information can help drive the simulation toward relevant conformations, their selection relies on the accuracy of the (physics-based) force field.

D. Machine learning in molecular simulations
Though not yet widely used, machine learning has made a number of contributions to assist the construction, execution, and analysis of molecular simulations.The following exclusively focuses on the link between the two fields.Machine learning models have helped parametrize molecular force fields for small molecules of both coarse-grained models 72 and static electrostatic multipole coefficients. 735][76] Finally, data-driven methods have been proposed to automatically detect molecular patterns 77 and transition-state dividing surfaces. 78

IV. RATIONAL DESIGN BASED ON PHYSICAL/CHEMICAL UNDERSTANDING
We now illustrate how cheminformatic concepts can be applied to design materials.Our goal here is to use a particular case study in order to exemplify soft-matter-related issues: the complexity of the forward problem, the limited accuracy of in silico predictions, and small training set sizes available from experimental sources.
The specific case study we are going to discuss is an organic solar cell.In these optoelectronic devices, light absorption leads to the generation of excited, strongly bound electron-hole pairs, or excitons.Excitons dissociate into free charge carriers at interfaces between two materials, one of which is the electron donor and the other one is the acceptor.Free charges then diffuse towards the electrodes and are injected into the external circuit.The main task of the compound design algorithm is to identify pairs of donor/acceptor molecules maximizing solar cell efficiency, as depicted in Fig. 2(b).
The straightforward cheminformatic approach would be to collect available experimental data and to correlate donor-acceptor pairs with solar cell efficiencies.In practice, such a brute-force strategy never works.First, a single figure of merit, such as solar cell efficiency in our case, is too complex to correlate directly to the chemical structure.Second, the number of experimental samples is normally limited to a few hundreds, which is not enough to train the model or refine the descriptors.
A thorough physical understanding of the problem is key to overcome the first issue: by being familiar with all elementary processes occurring in an organic solar cell, we can factorize its efficiency on several experimentally accessible quantities, such as the open circuit voltage V oc , short circuit current J sc , and fill factor FF.These quantities have been used to train a model of 33 descriptors on approximately 50 compounds. 79It has been found that V oc correlates well with the gas-phase ionization potential of the donor.Since the provided descriptors were trained only on a small set of donors, the model could not be used to predict efficiencies of compounds outside the training set.In soft matter systems, synthesis, device optimization, and characterization are often time consuming; hence, substantially extending the training set is not feasible to do experimentally.This motivates using computer simulations and to solve the forward problem in silico.
For organic solar cells, the simulation strategy for solving the forward problem is, in principle, well-defined: One needs to predict the materials morphology, calculate the energetic landscape for charges and excitons, evaluate rates of charge/exciton transfer, electron-hole, and exciton-electron recombination, solve the time-dependent master equation, and analyze distributions of currents and occupation probabilities to extract/optimize measurable properties such as open-circuit voltage, short circuit current for solar cells. 80,81As formulated, this roadmap is practically impossible to implement.Already the prediction of the materials morphology of partially crystalline systems is not feasible from first principles.Moreover, one is forced to use multiscale approaches: On a molecular scale, every molecule has its own unique environment created by its neighbors with local electric fields leading to level shifts, broadening, and spatial correlations of charge/exciton energies.On a mesoscopic scale, the size of phase-separated domains of the donor and acceptor governs the efficiency of exciton splitting and charge percolation.On a macroscopic scale, light in-coupling needs to be accounted for to maximize absorption.Likewise, the typical time scales of dynamic processes such as charge and energy transfer span several orders of magnitude.Hence, charge/exciton kinetics cannot be treated via numerical methods with a fixed time step, but rate-based descriptions must be employed instead.Finally, to be predictive, the methods need to be quantitative.That is, not only each particular method is required to provide accurate values for the quantities of interest but also the entire procedure of "bottom-up scaling" from the micro-to the macroscopic world should be robust.
At this level of complexity, simulations become as demanding as experiments.The benefit is of course that we learn a lot more details, especially at a molecular scale.This, however, is possible only for a few selected systems, e.g., P3HT:PCBM or DCV:C60 bulk heterojunction solar cells 82,83 and does not help us to extend the training set or to train a better model.As a compromise, one often adopts simplified solutions to the forward problem.Simplified schemes are, however, prone to systematic errors, since they are normally constructed phenomenologically or employ assumptions which are not valid for all systems. 84These models can be used to formalize our understanding of experimental systems but again, they are incapable of predicting properties of new systems.
Hence, cheminformatics seems to be not a very useful tool when it comes to problems with a complex forward problem, such as soft-matter systems.There is, however, a hidden bonus: data-driven approaches help to gain better physical and chemical understanding of the system.Indeed, in this particular example, they helped to establish a link between the optical gap in a solid state, which determines the amount of the harvested solar spectrum, and the gas-phase ionization potential of the donor.This correlation is of course rather trivial: for chemically similar compounds, the optical gap and the ionization energy minus electron affinity do correlate, at least in the gas phase.Analyzing this correlation further, we can conclude that it ignores the electron-hole interaction and environmental effects.These can then be added by using additional descriptors, e.g., molecular quadrupole moment 83,85 or the packing motif of the crystal structure.Hence, we have gained valuable information as to what (supra)molecular quantities are relevant to refine or simplify the solution of the forward problem.
Design strategy for donor/acceptor combinations for solar cells is of course not an isolated example: pre-screening of host-guest systems in organic light emitting diodes (OLEDs, see Fig. 2(c)) has evolved in a similar manner. 86Here, simple molecular descriptors are indeed used to rank the compounds of interest, 87 but the main simulation effort is dedicated to solving the forward problem (in this case predicting current-voltage-luminescence characteristics of an OLED) for a handful of systems. 88hus, by identifying the set of relevant descriptors, cheminformatics has helped us to better understand the physics of the system, to link it to molecular properties (chemistry), and to build and design a better model for solving the forward problem, as schematically illustrated in Fig. 3.Even though this does not necessarily lead to a prediction of new compounds, it does contribute to our physical and chemical understanding of the system and hence generates new ideas and models.

V. OUTLOOK
Computational materials design of soft-matter systems faces two problems: First, the computational investment necessary to accurately estimate thermodynamic properties from simulations remains significant.This hinders the number of putative materials one can study and has so far prevented the development of datasets of simulation results to build machine-learning models.Second, the complexity of soft-matter systems prevents the identification of few, relevant descriptors.Instead, these materials typically exhibit complex behavior at many length and time scales, due to their small characteristic energy, as well as experimental characterization issues, e.g., polydispersity or stereoregular defects.Rational design is inherently challenging due to the many variables soft-matter systems depend on.
We foresee that the onset of high-throughput computer simulations-enabled by both hardware (e.g., GPUs) and methodological (e.g., coarse-graining) advancements-will lead to data-driven approaches that better balance energetic and entropic contributions to thermodynamics.In the meantime, we showed that data-driven approaches already augment computer simulations by improving different aspects (e.g., parametrization, sampling, or analysis).This combination of data and physics-based models makes for more than the sum of its parts.Setting aside our understanding of physics and chemistry to make ways for statistics alone would be wasteful: machine learning models merely interpolate training sets-learning the complexity of these systems would require colossal amounts of data to correlate behavior our laws of physics can anyway predict.Probabilistic models of simulation trajectories (e.g., histogram reweighting, Markov state models) help us enforce laws of physics we already know, e.g., detailed balance in equilibrium systems, to help alleviate finite-statistics issues.Finally, data-driven approaches can be extremely beneficial when our understanding of physics and chemistry is lacking.

FIG. 1 .
FIG.1.Schematic link between chemical structure (left) and material properties (right).The forward process (colored in blue) consists of evaluating-via experiments, simulations, analytical theory, or statistical modeling-a number of material properties ϕ i for a chemical structure.The identification of simple descriptors, ϑ, which link the chemistry to material properties, enables an empirical backward optimization, highlighted here in magenta.

FIG. 2 .
FIG. 2. Cartoon representations of several examples presented in this review: (a) the competition between the unbound protein-ligand system and the complex; (b) a donor/acceptor interface for solar-cell design; and (c) emitter-host interactions in an organic light-emitting diode.Common properties of interest are denoted to the right.

FIG. 3 .
FIG.3.Role of cheminformatics in the soft matter compound design: identification of relevant descriptors for phenomenological models and (eventually) prediction of material properties for large training sets.
microscopic details, or ensemble of states-the probability distribution of this collection of states is called statistical ensemble.In what follows, we only consider equilibrium ensembles, i.e., those with no explicit time dependence in the phase-space distribution.