Leveraging multimodal microscopy to optimize deep learning models for cell segmentation

Deep learning provides an opportunity to automatically segment and extract cellular features from high-throughput microscopy images. Many labeling strategies have been developed for this purpose, ranging from the use of fluorescent markers to label-free approaches. However, differences in the channels available to each respective training dataset make it difficult to directly compare the effectiveness of these strategies across studies. Here, we explore training models using subimage stacks composed of channels sampled from larger, “hyper-labeled,” image stacks. This allows us to directly compare a variety of labeling strategies and training approaches on identical cells. This approach revealed that fluorescence-based strategies generally provide higher segmentation accuracies but were less accurate than label-free models when labeling was inconsistent. The relative strengths of label and label-free techniques could be combined through the use of merging fluorescence channels and using out-of-focus brightfield images. Beyond comparing labeling strategies, using subimage stacks for training was also found to provide a method of simulating a wide range of labeling conditions, increasing the ability of the final model to accommodate a greater range of candidate cell labeling strategies.


I. INTRODUCTION
As high-content, high-spatiotemporal cellular imaging becomes more widespread, the ability to perform cellular segmentation both quickly and accurately becomes increasingly critical for efficient cellular analysis and feature extraction. Advances in deep learning have positioned neural networks as a powerful alternative to traditional approaches such as manual or algorithmic-based segmentation. 1,2 In particular, the development of the U-Net architecture provided a significant boost to segmentation performance 3,4 and has now become the template for many modern segmentation models. [5][6][7] Advancements in our understanding of deep learning have also made the technique more accessible for smaller-scale operations. Techniques such as data augmentation have significantly reduced dataset size requirements, while improvements to training (e.g., transfer learning, initialization, dropout, hyperparameter schedulers, and optimizers) have reduced training times considerably. [8][9][10][11] Adapting deep learning techniques for cellular segmentation, however, presents some unique challenges due to the high levels of incongruity across images of cells. For instance, images of cells may appear significantly different as a result of either the imaging techniques 12 or even the imaging settings used (e.g., using a different numerical aperture or magnification of the objective lens). Furthermore, the cells themselves may demonstrate an enormous range of morphologies depending on factors such as cell type, cell confluency, and local environment. Cell appearance may also vary significantly depending on the cellular structures targeted through the experimental labels or dyes used. To address this variety, many segmentation algorithms are highly tailored to their target application and, therefore, do not experience widespread adoption. 13 The specialized nature of each dataset and the resulting solution also mean that it is difficult to compare individual labeling approaches or segmentation strategies and establish best practices. The challenges when developing a cell segmentation approach for novel applications are, therefore, threefold: (i) developing a method of assessing which imaging or labeling strategies produce the greatest segmentation accuracies, (ii) developing strategies to efficiently train models capable of maintaining an acceptable level of segmentation accuracy across a wide range of potential input configurations, and (iii) keeping dataset requirements small enough that they are feasible to retrain for new applications when necessary.
Traditional approaches to automated cell segmentation from microscope images generally fall into two main categories: fluorescence-based and label-free approaches. Fluorescence-based approaches often boast higher segmentation accuracies but require the addition of fluorescent markers. However, reliance on specific fluorescent markers confers some significant disadvantages as microscopy tends toward multiparametric, high-throughput imaging. Most notably, fluorescence-based segmentation limits multiparametric imaging by dedicating a portion of the fluorescence spectra for segmentation that might, otherwise, be used. Fluorescent markers can also induce stress on the cell, either directly or as a by-product of imaging, and are, therefore, best avoided when possible. [14][15][16] For genetically encoded sensors, the successful co-expression of the desired sensors and markers in a single cell becomes increasingly difficult in hard-totransfect cell lines, which limits the population of cells that can be both successfully segmented and analyzed. In contrast, label-free approaches (e.g., brightfield imaging) have the advantage of not requiring a fluorescent marker but often struggle with reduced performance in high confluency when the boundaries between cells are not distinct. 3 As biology moves further into multiplexed imaging, including using localized sensors of the microenvironment, it is often not feasible to sacrifice the bandwidth to an ideal segmentation marker. Thus, we envision using multiple sensors that are biologically relevant but nonideal for segmentation. For instance, genetically encoded sensors expressed in either the cytoplasm or the mitochondria may each be used to help demarcate and mask individual cells. However, the information provided by these signals is very different: cytoplasmic markers can clearly define cell boundaries in isolation but may become indistinct across adjacent cells; mitochondrial markers do not reach the limits of cell boundaries but provide a gap in fluorescence, which can be used to more broadly separate adjacent cells. The primary challenge when using this information is that the combination of auxiliary fluorescence signals available to the segmentation model may vary from experiment to experiment or cell to cell. This requires that a segmentation model is trained to maintain performance across a wide range of potential experimental labeling conditions, including the absence of fluorescence labels of any kind. This would be difficult under the traditional approach to training deep learning models, as accounting for all possible experimental configurations would require collecting and labeling a prohibitively large and expensive dataset. Fortunately, microscope images possess a unique property that may be exploited to substantially reduce this dataset requirement: the channels of microscope images exist as stacks of independent images. This means that a subset of channels from a larger image stack can be assembled to create an entirely new representative microscope image. For example, an image stack of a cell composed of channels capturing a cytoplasmic marker, a membrane marker, a mitochondrial marker, and a brightfield image can be used to simulate a cellular image where only the cytoplasmic marker is present. The ability to create representative image substacks from a larger image stack is in stark contrast to other vision-based image modalities (e.g., object detection using camera footage), where removing a specific color channel would produce an image that is no longer representative of the target data. Here, we use source image stacks composed of three fluorescence labels and brightfield images imaged at seven different focal planes to simulate a wide range of expected experimental images (see supplementary material Fig. S2) to train a robust cell segmentation model. For simplicity, we, henceforth, refer to the source image stack as the hyperlabeled image stack and subsets of images used during training as subimage stacks. Beyond reducing the dataset size, this approach to training confers some additional advantages, including the ability to compare different labeling approaches on an identical dataset. In particular, we use this dataset to explore new approaches to preprocess data entering segmentation models including the use of out-of-focus (OoF) brightfield images and the concept of merging fluorescence channels into a single input channel. To keep the implementation practical, our approach uses fewer than 300 labeled cell examples and can be trained in less than a day on a modern Graphics Processing Unit.

II. RESULTS
A. Comparing segmentation accuracies when using common cell labeling approaches Segmentation models were trained using three-channel subimage stacks generated from the ten-channel hyper-labeled image stacks [ Fig. 1(a) and Sec. IV D]. This approach allows a direct comparison of labeling strategies using a diverse set of source inputs [e.g., label-free segmentation using only brightfield images or fluorescence-based segmentation using a combination of cytoplasmic, membrane, and mitochondrial markers; Figs. 1(b) and 1(c)]. Generation of subimage stacks in this manner also permits more advanced features such as channel randomization, channel merge, channel dropout, and merge dropout [Figs. 1(d)-1(f)]. In particular, adding a channel dropout rate can be used to simulate varying expressions of a particular fluorescence channel across cells [e.g., cells that are variably labeled with a cytoplasmic tag; Fig. 1(e)]. The ability to simulate variable expression is critical for training models where any fluorescence channel may vary across cells, experiments, or even channels. This is particularly true when auxiliary fluorescence signals are used for cell segmentation.
Using this approach, we compared the performance of segmentation models trained on distinct subimage stacks loaded with a single fluorescent marker (the cytoplasmic, membrane, and mitochondria models), the in-focus brightfield channel (the brightfield model), or a combination of all three fluorescent markers (the fluorescence model), random channels (the random model), or black channels as a negative control (the "all black" model). Supplementary material Fig. S2 shows an example of each subimage stack. Training was performed for 105 epochs, which was sufficient for all models to reach at least 90% of their peak accuracy [ Fig. 2(a)]. Segmentation performance of the fluorescence, cytoplasmic, and membrane models was significantly better than that of the other approaches [ Fig. 2(b)]. More generally, singlechannel fluorescence images performed well when the fluorescence touched the cell boundary (cytoplasm and membrane, >96%), but poorly when this was not the case (mitochondria, reaching $92%). The brightfield model performed only slightly better than the mitochondrial equivalent (reaching $93%) but is not influenced by labeling conditions as is the case for the models based on fluorescent markers. To determine how variability in labeling would affect the performance of fluorescence-based models, a channel dropout rate [as outlined in Fig. 1(e)] was added to each fluorescence channel ranging from 0% to 100% with the latter representing black input channels [Figs. 2(c) and 2(d)]. As the level of dropout increased, performance converged to that of the all-black control in all models with the exception of brightfield. This was particularly devastating for models relying on a single fluorescent marker (cytoplasm, membrane, and mitochondria), where brightfield performance began to surpass that of the cytoplasmic or membrane models at $20% and $30%, respectively. In contrast, the use of three distinct fluorescent markers in the fluorescence model allowed it to suffer a dropout rate of $50% before performance dipped below that of the brightfield model. These data highlight the value of using specific fluorescence signals to improve segmentation performance as compared to brightfield alone; however, relying exclusively on fluorescence may significantly impact performance when labeling is inconsistent. Furthermore, these data suggest that only certain fluorescence signals offer an improvement over label-free approaches (e.g., membrane and cytoplasmic markers) and that brightfield may be the more effective option in others (e.g., the mitochondria).

B. Improving brightfield performance using out-offocus (OoF) brightfield imaging
Despite the aforementioned advantages of using label-free segmentation, the brightfield segmentation model performed significantly worse than the cytoplasmic, membrane, and three-marker fluorescence models. The brightfield model was particularly poor at distinguishing cell boundaries when cells were highly confluent [ Fig. 2 Examining intensity profiles revealed the likely cause here: cell boundaries are much more difficult to discern using brightfield images than their fluorescence counterparts, especially when cells are in close Once training is complete, model accuracy is assessed by comparing the average accuracy of each model over the final 10 training epochs. Note: the prediction image shown is from the example hyper-labeled image for simplicity; however, the actual validation is composed of images not seen during training. (d) Individual channels are constructed using three core techniques: (i) simple loading, which loads the channel directly, (ii) randomization, which chooses a channel at random from a preset subset of channels, and (iii) channel merging, which averages the intensities from multiple channels into a single channel. (e) Channels may also be affected by a "channel dropout" rate, which determines the probability that an individual channel will be replaced by a blank channel. (f) Channel merge may also be affected by its own "merge dropout" rate, which determines (https://www.overleaf.com/ project/5efdeef2f61ddb0001c6ddf6) the probability that a blank channel will be substituted in place of that particular channel during the merging.   Fig. 3(a)]. This mirrors human performance, where segmentation of confluent cells was much less accurate in brightfield images than in either cytoplasmic or membrane images (supplementary material Fig. S9). The segmentation of brightfield images can be impacted by focus. 22 As the focal plane shifts creating an out-of-focus (OoF) brightfield image, diffraction patterns begin to occur near the cell boundary. These diffraction patterns serve to either highlight or darken these edges, producing an intensity pattern with either peaks or valleys at the cell boundary [Figs. 3(a) and 3(b)].
Supplementing the in-focus brightfield channel with one OoF image above the plane of focus (þ3, þ5, or þ10 lm) and one below (À3, À5, or À10 lm) slightly improved performance compared to the in-focus brightfield model [Figs. 3(c) and 3(d)]. However, it was found that the optimal focal distance was cell-type dependent (supplementary material Fig. S6), with thinner AD293 cells performing better at lower offsets and thicker INS1E cells performing best at higher offsets. To account for these differences, a segmentation model (RAND Br AþB) was trained using subimage stacks composed of one of the lower OoF brightfield channels chosen at random, the in-focus brightfield image, and one of the higher OoF brightfield channels chosen at random. Although this did not result in the best overall performance of the models tested, it performed reasonably well on both cell types (supplementary material Fig. S6) presenting a more robust approach to training cells of variable morphologies or heights. These data suggest that using OoF brightfield images provide a method of improving the baseline performance of label-free segmentation models although care must be taken to account for the range of expected cell thicknesses and confluencies found across samples.

C. Fluorescence merging as a solution to uneven labeling
Despite the performance advantage afforded by incorporating OoF imaging, peak brightfield performance alone was still significantly below that of the fluorescence, cytoplasm, and membrane models [comparing Figs. 2(b) with 3(d)]. To combine the performance advantage of fluorescence with the reliability of brightfield under label-free conditions, we explored training segmentation models using combinations of fluorescence and brightfield channels (Fig. 4). For these models, subimage stacks were composed of one fluorescence channel, one in-focus brightfield channel, and one OoF brightfield channel, with the fluorescence channel composed of either an individual channel or a combination of all the fluorescence channels available [see supplementary material Fig. S2 and Figs. 1(d) and 1(e)]. In all cases, the introduction of a fluorescence channel either maintained or improved the accuracy of the brightfield alone [ Fig. 4(b)], with the addition of a cytoplasmic, membrane, or merged fluorescence channel conferring the largest advantages. The performance of fluorescence-based models (cytoplasm, mitochondria, membrane, and fluorescence) was previously found to substantially decrease under variable labeling conditions. To determine whether the inclusion of brightfield channels in the subimage stacks would guard against this effect, models were trained in a range of dropout rates as before. As the rate of dropout was increased, the fluorescence model dropped in performance significantly, while the combination models each converged to the performance of the OoF brightfield model [RND Br AþB, Fig. 4(c)]. These data suggest that training cell segmentation models using a combination of fluorescence and brightfield channels can produce a model that can effectively segment cells in the absence of fluorescent markers, while also being capable of capitalizing on fluorescent markers to improve performance when they are available. In particular, performing cell segmentation using subimage stacks composed of both a merged fluorescence and an OoF brightfield channel (the Merge þ Br model) presents itself as a promising strategy to maximize performance across a range of labeling conditions.
Cell segmentation models are robust if they can maintain segmentation accuracy across a wide range of labeling conditions, which can be simulated by altering the rate of dropout during the validation set independent of the rate used to train that model. For instance, a model trained without any dropout (i.e., a training dropout rate of 0%) can then be independently validated using a dataset where either all fluorescence channels are replaced by black equivalents (i.e., a validation dropout rate of 100%), no channels are replaced (i.e., a validation dropout rate of 0%), or a mixture (1%-99%). The robustness of a model can, therefore, be determined by its performance across the range of validation dropout rates spanning 0% to 100%. To determine how to maximize the robustness of the Merge þ Br model, 11 models were trained using a single merge dropout rate (ranging from 0% to 100% in 10% increments). Each was then validated across the full range of dropout rates [Figs. 4(d) and 4(e)]. Although models trained using low dropout rates had the greatest overall accuracy on welllabeled datasets, they quickly lost accuracy as labeling became more sparse [ Fig. 4(d)]. Similarly, the model trained without fluorescence (i.e., a merge dropout rate of 100%) lost performance as labeling was introduced, indicating that the presence of novel information was disruptive to the model if not previously encountered. Instead, models trained using a merge dropout rate between 40% and 60% were the most consistent across the labeling spectrum [ Fig. 4(e)]. A closer examination of the segmentation outputs [ Fig. 4(f)] from models trained using dropout rates of 0%, 60%, and 100% revealed that training a model using only one type of source input (i.e., 0% or 100%) obliterated its ability to segment cells when not presented with the same input. These data highlight the importance of maximizing the range of experimental conditions experienced by the model during training. These results were further validated using a test set (supplementary material Fig. S7).
To determine whether the benefits of combining a fluorescence merge channel with the brightfield were also beneficial to other cellbased datasets, an external dataset was sourced and labeled as described in Sec. IV A. For this dataset, segmentation using fluorescence alone was not significantly better than segmentation using the in-focus brightfield channel [supplementary material Fig. S8

III. DISCUSSION
The unique nature of microscope images provides many exciting opportunities for innovation when adapting deep learning techniques for cell segmentation. Notable among them is the independence of channels within a microscope image stack, such that representative subimage stacks can be generated from a larger source, or hyperlabeled, image stack [ Fig. 1(a)]. Training segmentation models from these subimage stacks confers some key advantages, including the ability to (i) directly compare labeling approaches using identical cells . Here, we demonstrate these advantages using a dataset composed of image stacks constructed from three fluorescent tags (cytoplasmic, membrane, and mitochondrial) and seven brightfield images (each at different focal planes) to both compare the relative advantages of fluorescence-and brightfield-based segmentation approaches and explore novel strategies. Central to this comparison was the trade-off between peak segmentation accuracy and consistency.
Fluorescence-based approaches boast strong accuracies for fully labeled cells [ Figs. 2(a) and 2(b)], but performed poorly as labeling became increasingly sparse [Figs. 2(c) and 2(d)]. In contrast, brightfield approaches had lower peak accuracy scores, but were labelindependent. Improving general performance was, therefore, accomplished using a two-pronged approach: first, by improving the base performance of brightfield images and second, making use of fluorescence information when available without relying on it explicitly. Improvements to brightfield performance were accomplished by augmenting brightfield images with out-of-focus (OoF) brightfield channels (Fig. 3), while fluorescence information was added through a bulk combination of all fluorescence signals (Fig. 4). What resulted was a APL Bioengineering ARTICLE scitation.org/journal/apb label-agnostic. In other words, source images with differing numbers of available fluorescence channels can be processed identically during either model training or predictive purposes. These results also serve to highlight the utility of using subimage stacks generated from the larger source image stacks to minimize dataset size requirements and compare approaches. To account for all the possible labeling combinations tested here without using subimage stacks would not only have required a prohibitively large dataset but also lost the statistical power gained by comparison strategies on identical cells. The ability to simulate various labeling conditions using subimage stacks was found to be particularly important when training models that are required to segment a more diverse set of experimental setups. For instance, we found that models trained on perfectly labeled cells boasted the highest accuracies but were also the quickest to lose performance as labeling became more inconsistent [ Fig. 4(c)]. In contrast, models trained on cells with varying degrees of fluorescence labeling (simulated using dropout rates between 40% and 60%) maintained a more consistent performance across the full range of labeling scenarios [Figs. 4(d) and 4(e)]. Although training models in this manner may not produce the highest academic segmentation accuracies, it is critical for segmentation models used in image analysis pipelines where the set of available fluorescence labels may vary considerably across experiments.
A more subtle advantage of collecting hyper-labeled image stacks for training is that they expedite the creation of ground-truth labels. In certain cases, this may even allow ground-truth labeling to surpass human level performance. For instance, segmenting brightfield images of cells in close proximity is difficult for manual operators, but simple when a cytoplasmic or membrane marker is present (see supplementary material Fig. S9). Even if the final objective is to create a label-free segmentation model, imaging cells with a cytoplasmic or membrane marker in addition to the relevant brightfield channels allows more precise ground-truth labels to be created than through the brightfield channels alone. The label-free model can then be generated by training the model using subimage stacks containing only the relevant brightfield channels. By extension, combinations of markers may even be used in conjunction with other algorithmic or deep learning segmentation approaches to automate label generation for larger datasets, whether or not those specific markers will be used to train the model.

IV. METHODS A. Dataset collection
Cells were labeled with three distinct subcellular markers for the cytoplasm (Turquoise2-tagged Apollo-NADP þ ), 17 the membrane (YFP-Mem), and the mitochondrial matrix (Mitotracker Deep-Red FM). Imaging was performed using an ASI widefield fluorescence microscope equipped with three excitation LEDs (405 nm, 505 nm, and 590 nm), polarizing slider, high-speed filter wheels (ASI) with emission filters (470/24, 535/30, and 632/60), automated stage, 60Â/1.43NA oil objective (Olympus), and an IRIS 15 camera. Additionally, brightfield images were taken of each cell at seven different focal depths (À10, À5, À3, 0, þ3, þ5, and þ 10 lm), with 0 lm representing cells in manual focus. Altogether, these 10-channels were used to form our hyper-labeled image stacks. From these images, 275 fully labeled cells (cells containing all three fluorescent markers) were manually isolated using a custom Fiji plugin to form our training and validation datasets. Ground-truth segmentation labels were generated in Gimp at the pixel level by two independent operators to minimize inter-operator bias and by alternating between the fluorescence and near-focus (À3 to 3 lm) brightfield channels to minimize biasing performance toward either imaging method. This final dataset represents a combination of 106 AD293 and 169 INS1E cells, with an additional 175 cell-free images added to the dataset as negative control. Of the cellular images, approximately 26% represented isolated cells (i.e., no direct contact with other cells), while the remaining 74% had at least one neighboring cell in the field of view. Cell images were resized to a final size of 400 Â 400 px (9.28 Â 9.28 lm 2 ) before entering the model.
An additional test set was created as described above using cells that were labeled with subcellular markers for the endoplasmic reticulum (Turquoise2-tagged Apollo-NADP þ ), the membrane (YFP-Mem), and the mitochondrial matrix (Mitotracker Deep-Red FM). This dataset was composed of 96 fully labeled cell images and 60 cellfree negative controls.
The overall methodology was confirmed on an externally sourced dataset. 18 Briefly, the macrophage cells in the dataset were labeled with two distinct subcellular markers for the cytoplasm (BODIPY 493/503) and the nucleus (Sytox). Image stacks of these cells were composed of both fluorescence channels with additional brightfield images taken at five different focal depths. The final dataset represents a combination of 264 labeled cells and an additional 122 cell-free images as negative controls.

B. Model and training parameters
Training was performed on a variation of the U-Net model, 3 which employs a descending arc (contracting path) to increase feature information followed by an ascending arc (expansion path) to combine feature and spatial information. The model used here was composed of a traditional ResNet34 19 architecture for the descending arc path and a custom ascending arc that used pixel shuffling during upsampling to reduce checkerboard artifacts 20 (see supplementary material Fig. S1). To make use of transfer learning, pretrained weights (provided in Ref. 21) from the ResNet34 model were used for the descending arc, while the ascending arc was randomly initialized.
The majority of hyperparameters used represent best-practice recommendations. 21 However, the learning rate, number of training epochs, and cross entropy weights were determined experimentally. The learning rate was scheduled as a variation of the 1 cycle policy 11,21 [supplementary material Figs. S3(a) and S3(b)]. The maximum learning rate was chosen by training the model over 100 iterations while gradually increasing the learning rate from 1 Â 10 À7 to 1 Â 10 À1 and recording the training loss. The learning rate chosen was found in the area of the steepest downward slope before loss started to rapidly increase for any of the configurations tested [supplementary material Figs. S3(c) and S3(d)]. As training was performed in two parts (5 epochs with the pretrained descending arc frozen and then 100 epochs with the entire network unfrozen), two maximum learning rates were chosen. Based on these results, the maximum learning rate was chosen conservatively to be 2 Â 10 À4 for the first 5 epochs and then increased from 2 Â 10 À6 to 1 Â 10 À4 across the network's parameter groups for the final 100 epochs.
Models were trained using a two-class weighted cross entropy loss, where weight½cls and x½cls are the weight and prediction for a particular class and x½j are the predictions for each individual class (cell or background). The use of cross-entropy loss provides the opportunity to weigh the losses related to the cell or the background differently.
Altering the background weight with respect to the cell weight produced conservative segmentations at low values and over-eager segmentations at high values [supplementary material Fig. S3(e)]. Overall accuracy declined in either direction; however, lower values were associated with fewer background pixels being misclassified as belonging to the cell, while the inverse was true for larger weights (supplementary material Fig. S4). As many cell masking applications would prefer falling slightly short of the cell boundary (false negative pixels) over surpassing the cell boundary (false positive pixels), a scaling factor of 0.5 was chosen to minimize false positive pixels without significantly impacting overall accuracy. Training and validation sets were divided using a randomized 80:20 split of the cells from the complete dataset. To determine how the split between the training and validation set may impact model accuracy, models were trained on all-black images using either a consistent (100 models) or a re-randomized (100 models) trainingvalidation split [supplementary material Figs. S3(f) and S3(g)]. Significant, normally distributed variability was found in the final training accuracy when the dataset was randomly split. When the dataset was consistently split, the accuracy did not vary. This indicates that it is critical to use identically split training and validation datasets across each model tested during a single experimental replicate. This also suggests that the relative performance of each model is a more useful metric than an absolute segmentation accuracy percentage. Based on these findings, splits were kept consistent across models during a run to reduce the impact of training-validation splits on relative performance.

C. Data augmentation
Significant data augmentation was used to keep the dataset relatively small. As outlined in supplementary material Fig. S5, examples of brightfield and fluorescence channels were passed through a variety of transforms including brightness and contrast, dihedral transforms, image flipping, image jitter, perspective warping, image rotation, skew, and symmetric warping. Parameter ranges for each transform were chosen that produced realistic cell images for both fluorescence and brightfield images. Despite the unnatural appearance, zero-padding was used when necessary to avoid twin-cell artifacts in the groundtruth labels (i.e., the presence of two labeled cells). Data augmentation beyond a squaring crop and resize was not applied during validation.

D. Subimage stack generation
Training was performed on three-channel subimage stacks assembled from the ten-channel, hyper-labeled image stacks. Creation of these three-channel subimage stacks was performed during training using model-specific loading codes. These loading codes provided independent, channel-specific instructions to a custom dataloader directing how each of the three channels would be assembled from the source dataset (see Fig. 1). The use of loading codes permitted more complex interactions with the source image stack's channels, providing the dataloader with the following abilities: • Simple loading: load a specific channel from the ten-channel source image stack. • Randomization: load a single random channel from a predefined subset of channels. • Channel dropout: perform a randomized test against a dropout percentage. Load the channel normally if passed, load a blank channel otherwise. • Merge: merge the contents from multiple channels into a single channel before loading. • Merge dropout: perform a randomized test against a dropout percentage and only include that particular channel in the merge if passed.
A complete representation of the loading combinations used can be found in supplementary material Fig. S2.

E. Plotting and metrics
Model performance was determined using their average segmentation accuracy, which was calculated as the percentage of pixels that were accurately classified as compared to the ground-truth label (also known as pixel accuracy). Unless otherwise noted, line graphs were plotted using a 10-point moving average for clarity (with empty padding for early epochs) and error bars represent the standard error of the mean. Significance was assessed using a paired t-test.

F. Ethics
No ethics approval was required for the work in this paper.

SUPPLEMENTARY MATERIAL
See the supplementary material for additional data on U-Net architecture, segmentation parameters, and test sets referenced in this article.