Design of intrinsically disordered proteins that undergo phase transitions with lower critical solution temperatures

Many naturally occurring elastomers are intrinsically disordered proteins (IDPs) built up of repeating units and they can demonstrate two types of thermoresponsive phase behavior. Systems characterized by lower critical solution temperatures (LCST) undergo phase separation above the LCST whereas systems characterized by upper critical solution temperatures (UCST) undergo phase separation below the UCST. There is congruence between thermoresponsive coil-globule transitions and phase behavior. Specifically, the theta temperatures above or below which the IDPs transition from coils to globules serve as useful proxies for the LCST / UCST values. This implies that one can design sequences with desired values for the theta temperature with either increasing (UCST) or decreasing radii of gyration (LCST) above the theta temperature. Here, we show that the Monte Carlo simulations performed in the so-called intrinsic solvation (IS) limit version of the temperature-dependent ABSINTH implicit solvation model, yields a robust heuristic for discriminating between sequences with known LCST versus UCST phase behavior. Accordingly, we use this heuristic in a supervised approach, integrate it with a genetic algorithm, combine this with IS limit simulations, and show how novel sequences can be designed that have LCST phase behavior. These calculations are aided by direct estimates of temperature dependent free energies of solvation for model compounds that are derived using the polarizable AMOEBA forcefield. To demonstrate the validity of our designs, we calculate coil-globule transition profiles using the full ABSINTH model and combine these with the Gaussian Cluster Theory to show that the designed IDPs do show LCST phase behavior.


Introduction
Intrinsically disordered proteins (IDPs) that undergo thermoresponsive phase transitions are the basis of many naturally occurring elastomeric materials 1 . These naturally occurring scaffold IDPs 2 serve as the basis of ongoing design efforts to design thermoresponsive materials 3 . Well-known examples of disordered regions derived from elastomeric proteins 4 , include the repetitive sequences from proteins such as resilins 5 , elastins 6 , proteins from spider silks 7 , titin 8 , and neurofilament sidearms 9 . Elastin-like polypeptides have served as the benchmark systems for the development of responsive disordered proteins that can be adapted for use in various biotechnology settings 10 . The interplay between sequence-encoded intermolecular and chainsolvent interactions combined with the interplay between chain and solvent entropy gives rise to thermoresponsive phase transitions that lead to the formation of coacervates 1 . Here, we show that one can expand the "materials genome" 11 through de novo design strategies that are based on heuristics anchored in the physics of thermoresponsive transitions and efficient simulation engines that apply the learned heuristics in a supervised approach. We report the development of a genetic algorithm (GA) and show how it can be applied in conjunction with multiscale computations to design thermoresponsive IDPs with LCST phase behavior.
Conformational heterogeneity is a defining hallmark of IDPs 12 . Work over the past decadeand-a-half has shown that naturally occurring IDPs come in distinct sequence flavors 12 . Indeed, IDPs can be distinguished based on their sequence-encoded interplay between intramolecular and chain-solvent interactions that can be altered through changes in solution conditions. Recent studies have shown that IDPs can be drivers or regulators of reversible phase transitions in simple and complex mixtures of protein and nucleic acid molecules 13 . These transitions are driven primarily by the multivalence of interaction motifs that engage in reversible physical crosslinks 14 .
IDPs can serve as scaffolds for interaction motifs (stickers), interspersed by spacers. Alternatively, they can modulate multivalent interactions mediated by stickers that are situated on the surfaces 15 of autonomously foldable protein domains 16 .
Thermoresponsive phase transitions arise either by increasing the solution temperature above a lower critical solution temperature (LCST) or by lowering the temperature below an upper critical solution temperature (UCST) 1  Studies of synthetic polymer systems have helped in elucidating the origins of the driving forces for and the mechanisms of LCST phase behavior 22 . A well-known example is poly-Nisopropylacrylamide (PNIPAM) 23 . Here, the dispersed single phase is stabilized at temperatures below ~32˚C by the favorable hydration of amides in the sidechains. Solvation of amides requires that the solvent be organized around the hydrophobic moieties that include the backbone carbon chain and the isopropyl group in the sidechain. The entropic cost for organizing solvent molecules around individual chains increases with increasing temperature. Accordingly, above the LCST of ~32˚C, and for volume fractions that are greater than a threshold value, the system phase separates to form a polymer-rich coacervate phase that coexists with a polymer-poor dilute phase. The driving forces for phase separation are the gain in solvent entropy through the release of solvent molecules from the polymer and the gain of favorable inter-chain interactions, such as hydrogenbonding interactions between amides in the polymer.
Tanaka and coworkers have developed a cooperative hydration approach, inspired by the Zimm and Bragg theories for helix-coil transitions 24 , to model the physics of phase transitions with LCST 25 . Cooperative hydration refers to the cooperative association (below the LCST) or dissociation (above the LCST) of water molecules that are bound to repeating units along the polymer chain 26 . Cooperativity is captured using the Zimm-Bragg formalism by modeling each repeating unit as being in one of two states viz., solvated or desolvated. In the solvated state, the repeating unit has a defined interaction strength with solvent molecules. In the desolvate state, pairs of such repeating units have defined exchange interactions. In addition, desolvation is associated with a gain in solvent entropy. The three-way interplay of direct solvent-chain interactions, the interactions among desolvated pairs of units, and the gain in solvent entropy above the LCST can be captured in a suitable physical framework that can be parameterized to describe system-specific phase transitions. Accordingly, if one has prior knowledge of the interaction energies associated with each repeat unit, one can use the framework of Tanaka and coworkers to design novel sequences with LCST behavior.
An alternative approach, which we adopt in this work, is to leverage the corollary of LCST behavior at the single chain limit 27 . At temperatures that are proximal to the LCST, the system of chain of interest will undergo a coil-to-globule transition in a dilute solution 28 . This is because the chain collapse is a manifestation of the physics of phase separation at the single chain limit. Here, we leverage this connection between phase separation and chain collapse of isolated polymer chains in ultra-dilute solutions to design novel IDPs that are predicted to undergo phase transitions with LCST phase behavior. To do so by using a multi-pronged approach that starts with improved estimates of the temperature dependencies of free energies of solvation of model compounds that mimic amino acid sidechain and backbone moieties. For this, we use free energy calculations based on the AMOEBA forcefield 29 , which is built on a second-generation polarizable model for water molecules and proteins. We incorporate these temperature dependent free energies of solvation into the ABSINTH implicit solvation model and show that thermoresponsive changes to chain dimensions, calculated in the efficient "intrinsic solvation (IS) limit" 30 yields robust heuristics that discriminates sequences with known LCST phase behavior from those that show UCST behavior. We then describe the development of a GA, an adaptation of the GADIS approach, to design novel sequences that relies on all-atom simulations, performed using the ABSINTH model in the IS limit, and learned heuristics as fitness scores. We show that distinct classes of designed sequences emerge from our approach. These are screened to filter out sequences with low disorder scores as assessed using the IUPRED2 algorithm 31 . The resulting set of sequences are analyzed using simulations based on the full ABSINTH model, which show that the designed sequences do undergo collapse transitions above a threshold temperature. The contraction ratio, defined as the ratio of chain dimensions at temperature T to the dimensions at the theta temperature, is computed as a function of simulation temperature is analyzed to extract temperature dependent two-body interaction parameters and athermal three-body interaction parameters that are used in conjunction with the Gaussian Cluster Theory (GCT) 32 to calculate system-specific phase diagrams 28 . The upshot is multiscale pipeline whereby a GA, aided by supervised learning in the form of a derived heuristic and IS limit simulations, leads to the design of novel sequences with predicted LCST phase behavior. Following a post-processing step that selects for sequences with a high confidence of being intrinsically disordered, we combine all-atom ABSINTH-T based simulations with Gaussian Cluster Theory to obtain sequence-specific phase diagrams. These last two steps allow further pruning of the sequence space derived from the designs and provide further confidence regarding the authenticity of the predicted LCST phase behavior. (1)

Results and Discussion
Here, ∆h is the enthalpy of solvation at a reference temperature T0, which is typically set to be 298K, and ∆cP is the molar heat capacity change associated with the solvation process. For simplicity, this is assumed to be independent of temperature 36 .
We build on the approach of Wuttke et al., which leverages the flexibility of the ABSINTH Gibbs-Helmholtz equation, which we then use in ABSINTH-T based simulations to design novel sequences that are predicted to show LCST behavior. Direct calculations of free energies of solvation obviate the need for making a priori assumptions when using experimentally derived values, which become especially problematic for model compounds with ionizable groups.
Results from AMOEBA-based free energy calculations for model compounds: We performed temperature dependent free energy calculations based on the Bennett Acceptance Ratio (BAR) free energy estimator 37 for direct investigation of how ∆µh varies with temperature. These calculations were performed for nineteen different model compounds that mimic the twenty sidechain moieties and the backbone peptide unit. For the free energy calculations, we used the AMOEBA forcefield, which uses atomic polarizabilities and atom-centered multipoles up to quadrupole moments 29 . The AMOEBA forcefield represents the state-of-the-art in modeling protein and peptide units in aqueous solutions using polarizability and higher-order electrostatics without having to resort to quantum mechanical calculations. Importantly, the AMOEBA water model reproduces the temperature-dependent anomalies of liquid water 38 a feature that is directly relevant for extracting temperature-dependent parameters to describe solvation. Additionally, the AMOEBA model yields accurate free energies of solvation of ions 39 and model compounds in aqueous solvents 29,38,40 . Details of the parameterization for model compounds used in this study, and the design of the free energy calculations are provided in the methods section.
Results from temperature dependent calculations of µh for the nineteen relevant model compounds are shown in Figure 1. These panels are grouped by the chemistries of functional groups within the different model compounds. The enthalpy of hydration (∆h) at T0 = 298K and the temperature independent heat capacities of hydration (∆cP) were extracted for each model compound by fitting the calculated temperature dependent free energies of solvation to the integral of the Gibbs-Helmholtz equation -see equation (1). The results are summarized in Table 1.
As expected for hydrophobic hydration 36 Table 1. In the legends we use the three letter abbreviations for each of the amino acids. Here, BB in panel (d) refers to the backbone moiety, modeled using N-methylacetamide, that mimics the peptide unit. * As with the default ABSINTH model, in ABSINTH-T, the rFoS values we used for ionizable residues are offset from the calculated ∆µh by a fixed constant of -30 kcal/mol. This, as was shown in the original work, is required to avoid the chelation of solution ions around ionizable residues. This "feature" remains a continuing weakness of the ABSINTH paradigm and one that we hope to remedy through suitable generalization of the model used in ABSINTH to interpolate between fully solvated and fully desolvated states.

Intrinsic solvation (IS) model, a non-electrostatics version of ABSINTH-T as an efficient heuristic for discriminating IDPs with LCST versus UCST behavior
In the single chain limit, accessible in dilute solutions, polypeptides that show LCST phase behavior undergo collapse above a system specific theta temperature, whereas polypeptides that show UCST phase behavior expand above the system specific theta temperature 1,28 . A GADISlike strategy 21 for de novo design of polypeptide sequences with LCST phase behavior would involve ABSINTH-T based all-atom simulations to evaluate whether an increase in temperature leads to chain collapse. In effect, the fitness function in a GA comes from evaluation of the simulated ensembles as a function of temperature. This becomes prohibitively expensive computationally. Accordingly, we pursued a pared down version of ABSINTH-T, which is referred to as the intrinsic solvation (IS) limit of the model 30 . The IS limit was introduced to set up sequence and composition specific reference models with respect to which one can use meanfield models to uncover how desolvation impacts IDP ensembles 30,49 . In effect, the IS limit helps us map conformations in the maximally solvated ensemble and assess how this ensemble changes as a function of temperature. In the IS limit, the energy in a specific configuration for the sequence of interest is written as: ; The only difference between the full model, see equation (2), and the IS limit is the omission of the Wel term. This increases the speed of simulations by roughly two orders of magnitude. Next, we asked if ensembles obtained from temperature dependent simulations performed in the IS limit could be used to obtain a suitable heuristic that discriminates sequences with LCST versus UCST behavior. These simulations were performed for a set of thirty sequences (see Table S1) that were previously shown by Garcia Quiroz and Chilkoti to have LCST and UCST phase behavior 17 . The results are summarized in Figure 2. As shown in panel (a) of Figure   2, the radii of gyration (Rg), suitably normalized for comparisons across different sequences of Given the range of sequences covered in the calibration based on the IS limit, we pursued an approach whereby we use slopes of vs. T as a heuristic to guide the design of a genetic algorithm to find new sequences with LCST phase behavior.  Table S1 in the supporting information. (b) The slope m of the RgN 0.5 vs. temperature profiles. These slopes fall into two distinct categories, one for those with LCST phase behavior (blue) and another for those with UCST phase behavior (red). The gray region corresponds to the values of m that clearly demarcate the two categories of sequences.

GA for the design of IDPs that are likely to have LCST phase behavior
Motivated by previous successes using the GADIS algorithm 21 for designing sequences with bespoke amounts of intrinsic secondary structure contents, we adapted a GA for exploring sequence space to discover candidate IDPs with predicted LCST phase behavior. To introduce the GA and demonstrate its usage, we set about designing novel sequences that are repeats of pentapeptide motifs. We focused on designing 55-mers, i.e., sequences with 11 pentapeptides. To keep the exercise simple, we focused on designing polymers that are perfect repeats of the pentapeptide in question. The GA used in this work is summarized in Figure 3 and the details are described below.
Here, N is the number of amino acids in each sequence, n is the number of replicas used in the simulation, and Ti is the temperature associated with replica i. The slope m was used to select 100 out of the 200 sequences that were chosen at random initially. The picking probability p was based on the following criterion: ; Here, c = 400 in units that are reciprocal to m, and m0 is set to -6.9 ´ 10 - allowed to evolve for multiple iterations until the convergence criteria were met. These include the generation of at least 250 new sequences, each with a value of m being less than -5.0 ´ 10 -3 åK -1 .
For the results presented here, six iterations were sufficient to meet the prescribed convergence criteria. The picking probability p determines the selection pressure encoded into the GA. There needs to be an optimal balance between the two extremes in selection pressure. High selection pressures can lead to early convergence to a local optimum whereas low selection pressures can drastically slow down convergence 51 . The use of a single evolutionary operator can lead to a single sequence becoming the dominant choice. The number of iterations that pass before the emergence of a single sequence is known as the takeover time 51 . High selection pressures lead to low takeover times and vice versa. The issue of a single dominant individual emerging because less of a concern in sequence design given the high dimensionality of sequence space. We tuned the choice of c and m0 to ensure that candidate sequences with putative UCST phase behavior can be part of the offspring, thus lending diversity to sequence evolution by the GA. categories based on their sidechain chemistries i.e., basic residues in blue bars, acidic residues in red bars (although these are not visible since they are not selected), polar residues in green, Pro and Gly in purple, and aliphatic as well as aromatic residues in cyan. Within each group, the bars are sorted in descending order of the mean numbers of occurrences in the designs. Figure 4 quantifies the progress of the GA through each iteration of the design process. The quantification is performed in terms of cumulative distribution functions, which for each iteration will quantify the probability that the emerging sequences have associated slope values that are less than or equal to a specific value. The rightward shift in each iteration is indicative of the improved fitness vis-à-vis the selection criterion, which is the lowering of m.

Panel (a) in
As a final step in the sequence design, we added a post-processing step to increase the likelihood that the designed sequences are bona fide IDPs. Accordingly, we used the disorder predictor IUPRED2 31 to quantify the disorder scores for each of the designed sequences.
IUPRED2 yields a score between 0 and 1 for each residue, and only sequences where over half of the residues in the repeat are above 0.5 were selected as the final set of designs as sequences predicted to have LCST phase behavior. A particular concern with designing sequences for experimental prototyping is the issue of aggregation / precipitation. To ensure that designs were unlikely to create such problems, we calculated predicted solubility scores using the CamSol program 52 and found that all sequences that were selected after the post-processing step also have high solubility scores. This provides confidence that the designed IDPs are likely to show phase behavior via liquid-liquid phase separation above system-specific LCST values without creating problems of precipitation / aggregation. preference for Arg over Lys, which is concordant with the distinct temperature dependent profiles for ∆µh (Figure 1) and the large positive heat capacity of Arg, which is roughly three times larger than that of Lys (Table 1). Interestingly, if we fix the positions of Pro and Gly and select for residues in XPXXG or other types of motifs that are inspired by previous work on elastin-like polypeptides, the design process often converges on repeats that are known to be generators of polypeptides with bona fide LCST phase behavior (data not shown). This observation, and the statistics summarized in Figure 4b indicate that the design process uncovers sequences that are likely to have LCST phase behavior. The actual sequences of the repeats, color-coded by their Hamming distance-based groupings, are shown in Figure 6. There are two features that stand out. First, sequences deviate from being repeats of VPGVG, which is the elastin-like motif. Second, we find that different sequence permutations on identical or similar composition manifolds emerge as candidates for LCST phase behavior. This observation suggests that at least in the IS limit it is the composition of each motif rather than the precise sequence that underlies adherence to the selection pressure in the GA.

Panel (b) in
Interestingly, our observations are in accord with results from large-scale in vitro characterizations of sequences with LCST phase behavior 53 . These experiments show that composition, rather than the precise sequence, is a defining feature of LCST phase behavior -a feature that is distinct from sequences that show UCST phase behavior 3 .   Next, we used the estimates of Tq in conjunction with the Gaussian Cluster Theory of Raos and Allegra. We extracted the two and three-body interaction coefficients by fitting the contraction ratio as calculated from simulations using the formalism of the Gaussian Cluster Theory and this yields sequence-specific estimates of B, the two-body interaction coefficient, and w, the threebody interaction coefficient (see panels (a) -(c) in Figure 8). These parameters were then deployed to compute full phase diagrams using the numerical approach developed by Zeng et al., 28 and adapted by others 54 . The results are shown in panels (d) -(f) of Figure 8. The abscissae in these diagrams denote the bulk polymer volume fractions whereas the ordinates quantify temperature in terms of the thermal interaction parameter . Here, which is positive for T > Tq, B is the temperature-dependent two-body interaction coefficient inferred from analysis of the contraction ratio, and nK is the number of Kuhn segment in the single chain, which we set to be five. Note that B is negative for temperatures above Tq. Accordingly, the thermal interaction parameter is positive above Tq as well as the critical temperature Tc. Therefore, comparative assessments of the driving forces for LCST phase behavior can be gleaned by comparing the sequence-specific values of and the volume fraction at the critical point.
It follows that the sequences can be arranged in descending order of the driving forces as (TPTGM)11, (RTAMG)11, and (PTPLV)11, respectively. Importantly, full characterization of the phase behavior using a combination of all-atom simulations and numerical adaptation of the Gaussian Cluster Theory shows that, in general, sequences designed to have LCST phase behavior, do match the predictions (see Figure 8).

Discussion
In this work, we have adapted a GA to design novel sequences of repetitive IDPs that we predict to have LCST phase behavior. Our method is aided by a learned heuristic that was shown to provide clear segregation between sequences with known LCST vs. UCST phase behavior. This heuristic is the slope m of the change in RgN 0.5 vs. T from simulations of sequences performed in the IS limit of the ABSINTH-T model. We use the heuristic in conjunction with IS limit simulations to incorporate a selection pressure into the GA, thereby allowing the selection of sequences that are "fit" as assessed by the heuristic to be predictive of LCST phase behavior.
Here, we presented one instantiation of the GA and used it to uncover 64 novel sequences that can be grouped into four major classes and several minor classes (Figure 6). We then focused on four sequences, one each from each of the four major classes and characterized temperature dependent coil-globule transitions. These profiles, analyzed in conjunction with recent adaptations of the Gaussian Cluster Theory 32 , allowed us to extract sequence-specific values for theta temperatures, temperature dependent values of the two body interaction coefficients, and threebody interaction coefficients. We incorporated these parameters into our numerical implementation 28 of the Gaussian Cluster Theory to calculate full phase diagrams for three sequences. These affirm the predictions of LCST phase behavior and the sequence-specificity in control over the driving forces for thermoresponsive phase behavior.
Our overall approach is aided by the following advances: We used the AMOEBA forcefield 29 to obtain direct estimates of temperature dependent free energies of solvation for model compounds used to mimic sidechain and backbone moieties. These temperature dependent free energies of solvation were used in conjunction with the integral of the Gibbs-Helmholtz equation to obtain model compound specific values for the enthalpy and heat capacity of hydration. The AMOEBA calculations represent the first direct estimates of temperature dependent hydration free energies using a polarizable forcefield, and they help circumvent many of the assumptions that have to be made in dissecting thermodynamic data for whole salts or protonated / deprotonated versions of acidic / basic groups in order to obtain experimentally derived estimates for temperature independent 20 as well as temperature dependent µh values 19 . The only simplifications we make here are (a) the usage of the integral of Gibbs-Helmholtz equation to fit the temperature dependent µh values and (b) assuming that ∆cP is independent of temperature. To test the validity of these assumptions we will need to deploy novel BAR based estimators that allow us to quantify the temperature dependencies of enthalpies and entropies 55 . However, obtaining accurate estimates of the decompositions requires orders-of-magnitude more sampling as shown by Wyczalkowski et al., 55 and we will need to extract second moments of the enthalpies and entropies in order to obtain independent estimates of the temperature dependence of ∆cP.
The methods we present here are a start toward the integration of supervised learning to leverage information gleaned from systematic characterizations of IDP phase behavior and physical chemistry based computations that combined all-atom simulations with improvements such as ABSINTH-T, and theoretical calculations that allow us to connect single chain coil-globule transitions to full phase diagrams 28 . The heuristic we have extracted from IS limit simulations helps with discriminating sequences with LCST vs. UCST phase behavior. These simulations are sufficient for IS limit driven and GA aided designs of sequences that are expected to have LCST phase behavior. This is because composition as opposed to the syntactic details of sequences play a determining role of LCST phase behavior 3 . Recent studies have shown that even the simplest changes to sequence syntax can have profound impacts on UCST phase behavior 56 . This makes it challenging to guide the design of sequences with predicted UCST phase behavior that relies exclusively on IS limit simulations. One will need to incorporate simulations based on either transferrable 5757 or learned coarse-grained models 58 as a substitution for the IS limit simulations.
This approach, although easy to articulate, comes with challenges because one has to be sure that the coarse-grained models afford the requisite sequence specificity without compromising  65 . Of course, the proof of the validity / accuracy of designs and predictions will have to come from experimental work geared toward testing the predictions / designs. These efforts -that leverage high-throughput expression of these de novo sequences in E. coli and in situ characterization of their phase behavior -are underway 66 . Initial experimental investigations suggest that the designs reported here and those that will emerge from application of the methods deployed in this work do indeed show LCST phase behavior. Detailed reports of these experimental characterizations will follow in separate work.

AMOEBA forcefield parameterization for the model compounds of interest
To obtain values of free energies of solvation from AMOEBA simulations, we first derived the AMOEBA force field parameters for the model compounds listed in Table 1 of the main text. The parameters for methane, methanol, ethanol, toluene and p-Cresol are taken from previous work 38 , which is part of the amoeba09.prm parameter file in the released TINKER package 67 . The parameters for other model compounds are derived as follows. The AMOEBA potential energy function is composed of bonded and non-boned terms 29 . Bonded terms include bond stretching, angle bending, bond-angle stretch-bending coupling, out-of-plane bending, and torsional rotation.
Non-bonded terms include van der Waals, permanent electrostatics and induced dipole polarization. The molecular structures were first fully optimized at MP2/6-31G* level of theory 68 .
The molecular structures were first fully optimized at MP2/6-31G* level of theory. Force constant parameters for bonded interactions, van der Waals interactions, and polarizabilities were assigned from the existing Poltype library, and the equilibrium bonded values were from optimized geometry. Based on the optimized geometry for each model compound, we performed single point energy calculations at the MP2/cc-pvtz level of theory to obtain compound specific electron densities. The initial multipole (charge, dipole and quadrupole) parameters were obtained from those electron densities by performing distributed multipole analysis calculation using GDMA program 69 . The atomic dipole and quadrupole moment parameters were further refined to reproduce the MP2/aug-cc-pvtz electrostatic potential (ESP). It is worth mentioning that harmonic restraint was applied to dipole and quadrupole in the ESP fitting, which is the new feature of the potential program in the TINKER software 67 . All the derived parameters involving bonded and non-bonded terms were then collected together. Finally, torsional parameters were obtained by comparing the conformational energy profile of quantum mechanical and AMOEBA based calculations. Along each torsional angle of the molecule, 12 conformers were generated. Torsionrestrained optimization was performed by employing HF method combining with 6-31G* basis set, followed by single point energy calculation using ωB97XD functional 70 with a larger basis function 6-311++G(d,p). All the quantum mechanics calculations were performed using the Gaussian 09 software package 71 . The parameterization procedure has been automated in the Poltype tool (https://github.com/pren/poltype/tree/poltype2).

Set up of molecular dynamics simulations using AMOEBA
All the AMOEBA simulations were performed using the TINKER-OpenMM package 72 .
Each model compound was solvated in a cubic water box with periodic boundary conditions. The initial dimensions of the central cell were set to be 30×30×30 Å 3 . Following energy minimization, molecular dynamics simulations were performed using integrators designed for the isothermalisobaric ensemble (NPT) with the target temperature being between 273 and 400 K depending on the temperature of interest and the target pressure being 1 bar. Integrations of the equations of motion were performed using the multiple time step RESPA integrator 73 . The temperature and pressure were controlled using a stochastic velocity rescaling thermostat 74 and a Monte Carlo constant pressure algorithm 75 , respectively. The particle mesh Ewald method 76 , with B-spline interpolation 77 , with a real space cutoff of 7 Å was used to compute long-range corrections to electrostatic interactions. The cutoff for van der Waals interactions was set to be 12 Å. The integrating timestep is set to be 2.0 fs and coordinates are saved every 1.0 ps.

Free energy calculations
We used the BAR 37 method to quantify the free energies of solvation for the model

Supporting Information
Please see supporting information for the sequences shown by Garcia Quiroz and Chilkoti to have UCST and LCST phase behavior that were used for IS limit simulations in this study.

Data Availability Statement
The data that support the findings in this study are available within the article and in the Supporting Information.