Designing lipid nanoparticles using a transformer-based neural network

The RNA medicine revolution has been spurred by lipid nanoparticles (LNPs). The effectiveness of an LNP is determined by its lipid components and their ratios; however, experimental optimization is laborious and does not explore the full design space. Computational approaches such as deep learning can be greatly beneficial, but the composite nature of LNPs limits the effectiveness of existing single molecule-based algorithms to LNPs. Addressing this, our approach integrates the multi-component and multimodal features of composite formulations such as LNPs to predict their performance in an end-to-end manner. Here we generate one of the largest LNP datasets (LANCE) by varying LNP formulations to train our deep learning model, COMET. This transformer-based neural network not only accurately predicts the efficacy of LNPs but is adaptable to non-canonical LNP formulations such as those with two ionizable lipids and polymeric materials. Furthermore, COMET can predict LNP performance in a cell line outside of LANCE and predict LNP stability during lyophilization using only small training datasets. Experimental validation showed that our approach can identify LNPs that exhibit strong protein expression in vitro and in vivo, promising accelerated development of nucleic acid therapies with extensive potential across therapeutic and manufacturing applications.

Similar content being viewed by others

Optimization of lipid nanoparticles for the delivery of nebulized therapeutic mRNA to the lungs

Article06 October 2021

Acid-degradable lipid nanoparticles enhance the delivery of mRNA

Article23 August 2024

Reformulating lipid nanoparticles for organ-targeted mRNA accumulation and translation

ArticleOpen access05 July 2024

Main

For clinical1, logistical2 and translational3 success, most drug substances are formulated into drug products with multiple ingredients. Our analysis shows that, on average, eight excipients are present in commercial products4. Given choices of ingredients and their ratios, formulation design presents a vast search space. While high-throughput approaches exist5,6,7,8, they become intractable with increasing formulation complexity.

Deep learning, a branch of machine learning suited for multifactorial data, can help address this challenge. Although widely used in drug discovery and materials science9,10,11, its application to multi-component drug products is limited. We apply deep learning to RNA-based lipid nanoparticles (LNPs), a promising class of drug products12,13,14, underscored by the success of SARS-CoV-2 messenger RNA vaccines15,16,17,18. LNPs comprise four lipid classes, each crucial for cytosolic RNA delivery12,14,19. Their function depends on lipid structures and ratios20,21,22, with composition requiring re-optimization per application1,23.

Given these challenges, early efforts applying machine learning to drug delivery have emerged, including recent work from our group24 and others25. As lipid chemical structure has a major impact on transfection, a line of modelling approaches focus primarily on individual molecules25,26. These approaches have been remarkably successful in identifying new lipids and chemical substructures, which would otherwise not be expected to produce high transfection efficacy26. Some models rely on manually selected features such as physicochemical properties26,27,28,29. These face limitations: restricted LNP scope, underuse of raw data, synthetic feasibility constraints and lack of formulation composition insights. To unlock deep learning’s full utility for LNP design, a model must represent complete formulations and generalize across predictive scenarios.

We introduce the Composite Material Transformer (COMET), which encodes molecular structures, molar percentages and synthesis parameters in a transformer-based architecture. COMET is trained on the Lipid–RNA Nanoparticle Composition and Efficacy (LANCE) dataset of over 3,000 LNPs, including those with dual-ionizable lipids. COMET accurately predicted excluded samples and, via in silico screening of 50 million virtual LNPs, identified top candidates with high in vitro and in vivo expression. Unlike lipid-focused models, COMET showed versatility with polymeric materials, predicting efficacy from limited data. Using two smaller datasets (~10% of LANCE)—one in a gastrointestinal cell line, another post-lyophilization—we demonstrate COMET’s robust adaptability. With its flexibility and broad utility, COMET promises to accelerate complex drug product development.

Results

LNP design with COMET

Composite materials such as LNPs comprise multiple components defined by the identity of constituent compounds, their relative ratios and formulation parameters such as nitrogen-to-phosphate (N/P) and mixing ratio (Fig. 1a). Previous studies often focused on a single component (for example, ionizable lipid)25; by contrast, we developed COMET to holistically represent LNPs and predict their efficacy via a flexible neural architecture (Fig. 1b). Lipid structures are encoded into molecular embeddings, while molar percentages are transformed into composition embeddings. These are concatenated to represent each lipid. Formulation-wide features, such as N/P and phase mixing ratios, are also embedded and fed into the model (Methods).

figure 1
Fig. 1: COMET predicts an LNP’s efficacy by inferring its formulation-wide properties, components’ molecular structures and compositions.

a, LNPs are synthesized by mixing nucleic acid (for example, mRNA) with a lipid solution typically composed of four lipid classes. Key properties, such as efficacy, depend not only on the lipids’ structure but also on their relative ratios and other mixing parameters (for example, N/P and aqueous/organic ratio). b, COMET can predict properties of composite materials such as LNPs from their components, compositions and other parameters. c, With high-throughput screening, COMET’s training data are made up of four parts, each spanning a complementary LNP formulation space. d, Thirteen lipid molar ratios used in the majority of the LNP training dataset. Created with BioRender.com.

COMET adopts a transformer design similar to language models such as ChatGPT30,31, where chemical components and formulation features act as discrete tokens. It accommodates arbitrary numbers of components, including dual-ionizable lipid formulations. An LNP-level Classify ([CLS]) token attends to component and formulation vectors via self-attention32, with the final prediction made through a task-specific prediction head. For multitask learning, distinct CLS tokens and heads are used per task to capture differences across cell types while sharing model knowledge.

To train COMET, we use a pairwise ranking objective that learns to rank LNPs by efficacy (Methods). Noise augmentation improves robustness against experimental noise, and a label margin captures efficacy differences. In multitask settings, CAGrad helps align gradients across tasks33. We further enhance performance with an ensemble of COMET models, especially beneficial in low-data regimes34,35.

LANCE dataset

To train COMET, we developed a high-throughput pipeline to generate the LANCE dataset. Each LNP in this dataset encapsulated a firefly luciferase (FLuc) messenger RNA (mRNA), and transfection efficacy was quantified by bioluminescence readouts. The LNPs were synthesized using automated fluid handling and tested in vitro. The LANCE dataset spans a wide design space structured in four parts: lipid identities (parts 1 and 2), synthesis parameters such as N/P and aqueous/organic mixing ratio (part 3), and lipid molar percentages (part 4) (Fig. 1c). Thirteen distinct molar ratios were used (Fig. 1d and Supplementary Table 13), generating over 6,000 labelled data points, including 3,028 LNPs evaluated in mouse DC2.4 and B16-F10 cells. Bioluminescence values were log-transformed and normalized between 0 and 1. Full methodological details are in Methods.

In DC2.4 cells, LNPs with CKK-E12 or C12-200 as ionizable lipids outperformed those with DLin-MC3-DMA (Fig. 2a). Helper lipids (for example, 1,2-dioleoyl-sn-glycero-3-phosphoethanolamine (DOPE)), sterols (cholesterol/beta-sitosterol) and polyethylene glycol (PEG) lipids (C14-PEG) also had substantial effects on efficacy. Hence, the LANCE dataset successfully captured these previous observations12,20,36,37. Molar ratios had notable but formulation-specific effects on efficacy; no single ratio consistently outperformed across all lipid combinations. This highlights the need for context-specific optimization. Altering aqueous/organic phase ratios from 3:1 to 1:1 affected efficacy in helper-lipid-rich LNPs (Fig. 2b), although the effect diminished at lower helper lipid content (Fig. 2c) and was negligible for 1,2-distearoyl-sn-glycero-3-phosphocholine (DSPC)-based formulations. The N/P ratio, by contrast, showed no clear efficacy association (Fig. 2d). We further evaluated five-component formulations by adding a second ionizable lipid (3:2 ratio) alongside DOPE, cholesterol and C14-PEG. Potent ionizable lipids such as CKK-E12 and C12-200 enhanced weak lipids such as L319 or DLin-MC3-DMA (Fig. 2e). Notably, CKK-E12/L319 combinations outperformed both CKK-E12-only and dual-strong-lipid pairings, particularly at 25% total ionizable lipid content.

figure 2
Fig. 2: Effect of formulation parameters on LNP efficacy.

Cross-cell line comparison of DC2.4 and B16-F10 results revealed 772 formulations that were in the top 30th percentile in both (Fig. 2f), commonly containing C12-200, DOPE, cholesterol and C14-PEG. Formulations with selective activity were also identified: SM102 appeared frequently in DC2.4-high but B16-F10-low cases, while DC-cholesterol was enriched in the opposite group. In summary, transfection efficacy is governed not only by lipid identities but also by molar ratios and synthesis conditions—motivating the need for models such as COMET that can integrate and learn from multifactorial design spaces.

Performance of COMET

We evaluated COMET on a random 20% test split of LNPs, with 10% used for validation and the remaining 70% for training. When trained to predict DC2.4 efficacy, COMET accurately ranked the test samples, achieving a Spearman coefficient of 0.873 and a Pearson coefficient of 0.866 (Fig. 3a). To simulate a more realistic drug discovery scenario, we curated a ‘hits-test’ split where the top 10% of DC2.4 LNPs were withheld as ‘hits’, alongside a random 10% of ‘non-hits’. COMET retained strong predictive power, yielding a Spearman coefficient of 0.725 and a Pearson coefficient of 0.820 (Fig. 3a). Its ability to classify ‘hits’ into the top half of ranked predictions reached 79.6% accuracy.Fig. 3: COMET predicts efficacy accurately and finds new hits.

figure 3

a, Performance of COMET on different DC2.4 test data splits after training on DC2.4 LNP efficacy data. b,c, Ablation results showing how different modules contribute to COMET’s ranking performance (b) and accuracy (c) on DC2.4 ‘hits-test’ test set. d, Schematic of in silico hit selection that begins with a large virtual LNP library, in silico screening with COMET and filtering based on LNPs’ properties such as efficacy and diversity. e,f, In vitro validation of exploratory in silico hits in DC2.4 (e) and B16-F10 (f) cells. g–i, Lead optimization around three top-performing LANCE LNPs in DC2.4 cells, namely LA-388 (g), LA-580 (h) and LA-2791 (i), was performed with COMET, each of which yielded three new formulations (denoted with the prefix DO) and evaluated experimentally. j–l, Lead optimization around three top-performing LNPs in B16-F10, namely LA-4 (j), LA-2638 (k) and LA-3062 (l), was performed with COMET, each of which yielded three new formulations (denoted with the prefix BO) and evaluated experimentally. Twenty replicates of training run with different random seeds were used for evaluation in a–c. Four technical replicates were used for e–l. Error bars are s.e.m. Statistical significances in e–l were determined using a one-way analysis of variance (ANOVA) with post-hoc Tukey test. MT, multitask; RO, regression objective; PO, pairwise ranking objective; CG, CAGrad; NA, noise augmentation; LM, label margin. Panel d created with BioRender.com.

In a multitask learning set-up using both DC2.4 and B16-F10 labels, COMET’s performance improved on the DC2.4 test set, achieving a Spearman of 0.762 and a Pearson of 0.860 (pairwise ranking objective + multitask) model; Supplementary Fig. 1). These improvements scaled with the size of the additional B16-F10 data (Supplementary Fig. 2), highlighting the benefit of shared representation learning across related tasks38. We performed ablation studies to understand the contribution of each modelling component. Replacing the pairwise ranking objective with a regression objective slightly reduced performance (Fig. 3b,c). Model enhancements—ensemble learning, noise augmentation, label margin and CAGrad regularization—each contributed to performance gains, with ensembling having the greatest effect (Fig. 3b,c). Gains from ensembling plateaued beyond five models (Supplementary Fig. 3), and so we used an ensemble of five COMETs for all in silico screening.

To probe whether COMET learns meaningful structure–activity relationships, we performed adversarial perturbations. When lipid identities in training samples were partially shuffled, model performance degraded monotonically (Supplementary Fig. 4). More aggressive shuffling across lipid classes led to a further drop, and corrupting additional formulation parameters (for example, N/P ratio, molar % and phase ratios) impaired performance even more (Supplementary Fig. 5). COMET’s generalizability was tested by excluding LNPs containing selected ionizable lipids (MC3, SM-102 and CKK-E12) and a sterol (beta-sitosterol) from training. The model maintained good performance on this chemically distinct test set, with Pearson/Spearman correlations of 0.779/0.776 for DC2.4 and 0.502/0.509 for B16-F10 efficacy prediction (Supplementary Fig. 6a,b).

Lastly, we compared COMET against simpler baselines. Both random forest and COMET outperformed k-nearest neighbours, with COMET’s single-model performance comparable to random forest (Supplementary Fig. 7). An ensemble of COMETs offered improved correlation metrics for both cell lines, although top 50% classification accuracy remained similar to that of random forest.

To determine whether COMET learns transfection efficacy rather than proxying classic LNP properties, we analysed correlations between COMET predictions and nanoparticle characteristics such as encapsulation efficiency, size, polydispersity and zeta potential. COMET’s predictions were only weakly correlated with these properties—except for particle size (correlation 0.6530)—but were highly correlated with actual transfection data (>0.95; Supplementary Figs. 8 and 9). This confirms that COMET predicts efficacy itself, not just physical proxies.

Experimental validation of COMET in silico hits

To evaluate COMET’s ability to discover effective formulations beyond LANCE, we screened a virtual library of nearly 50 million LNPs and validated selected in silico ‘hits’ experimentally (Fig. 3). Exploratory hits were chosen by excluding LNPs similar to top-performing LANCE formulations and then selecting chemically diverse candidates predicted by COMET to be highly efficacious (Methods, Fig. 3d and Supplementary Fig. 10).

All exploratory hits outperformed clinically approved LNPs39 (SM-102 (ref. 15), ALC-0315 (ref. 16) and DLin-MC3-DMA40) in both DC2.4 and B16-F10 cells (Fig. 3e,f). The top DC2.4 exploratory hit matched two of three top LANCE hits (Supplementary Fig. 11), while the best B16-F10 exploratory hit exceeded all three LANCE hits tested (Supplementary Fig. 12).

We next evaluated COMET’s ability to refine existing leads. Around selected LANCE hits, virtual candidates were generated by modifying lipid ratios, substituting components or changing N/P ratios (Methods and Fig. 3d). In DC2.4, COMET identified optimized formulations outperforming their parent in two of three cases (Fig. 3g–i); in B16-F10, all three optimized LNPs outperformed their respective parents (Fig. 3j–l).

Adapting to new materials

To assess COMET’s adaptability beyond lipids, we extended it to branched poly(beta-amino esters) (PBAEs)3,41 (Fig. 4a), a class of polymers. A dataset of 454 polymer–LNPs (13 unique PBAEs) was added to LANCE. Each PBAE was represented by its diacrylate–amine unit and branching agent (Methods and Fig. 4b). In the ‘hits-test’ setting, COMET achieved Spearman coefficients of 0.767 (DC2.4) and 0.756 (B16-F10) (Fig. 4c,d). Notably, PBAE LNPs represented only 13% of the training data.

figure 4
Fig. 4: COMET can be adapted to applications of new material, a new cell type and stability.

Even when trained on just 17 PBAE LNPs plus LANCE data, COMET achieved a mean Spearman of 0.660 across both cell types, improving to 0.824 with the full 352-sample PBAE set (Fig. 4e). We selected top-performing PBAE LNPs for further optimization using COMET (Methods). Optimized candidates showed higher efficacy than their parent formulations in both DC2.4 cases (Fig. 4f,g) and in one B16-F10 case (Fig. 4h,i).

Adapting to new target cell and payload

To evaluate its capability to adapt to a new cell type, COMET was tested on a dataset of 295 LNPs screened in human Caco-2 cells. Activity poorly correlated with mouse cells (Supplementary Fig. 15), indicating a need for learning-based formulation strategies to quickly identify new formulations that are optimal for new settings. COMET trained on Caco-2 data achieved a Spearman coefficient of 0.639, which improved to 0.713 with LANCE multitask training, and to 0.794 (Pearson 0.806) using a five-model ensemble (Fig. 4j,k).

We further tested COMET on HepG2 cells transfected with interleukin (IL)-15 mRNA using 98 LNPs. With multitask ensemble training, we observed strong predictions (Pearson 0.709 and Spearman 0.775; Fig. 4l,m). Additional LANCE data did not improve single-model accuracy (Supplementary Fig. 16) but enhanced ensemble performance (Fig. 4l,m), likely owing to increased diversity among the COMET models when trained with additional LANCE data.

Application in stabilization of LNPs

To address the instability of LNPs at ambient temperatures, we trained COMET to predict efficacy loss post-lyophilization (Methods). We synthesized 168 LNPs with variable lipids and 20% (w/v) sucrose as the stabilizer. Post-lyophilization data (Supplementary Fig. 17a) revealed that top performers included CKK-E12 and C12-200 as ionizable lipids, and DOPE as the helper lipid. Interestingly, these were not always top performers pre-lyophilization. DC-cholesterol-containing LNPs ranked higher post-lyophilization. Using 148 samples for training/validation and 20 for testing, COMET achieved a Spearman of 0.492. With LANCE multitask data, this increased to 0.705, and to 0.788 with five-model ensembles (Fig. 4n,o).

In vivo screening of in silico hits

We selected one hit from each virtual LNP group (‘Experimental validation of COMET in silico hits’ section) for in vivo validation in mice. Compared with DLin-MC3-DMA and SM-102 clinical benchmarks, COMET hits yielded >40-fold and >5-fold higher bioluminescence, respectively (Fig. 5a,b), with faster transfection kinetics (Fig. 5c). Top-performing COMET hits (DE-4, DO-388-1 and BE-1) also had higher encapsulation efficiency (Supplementary Fig. 19a) and lower cytotoxicity than SM-102 (Supplementary Fig. 20).

figure 5
Fig. 5: COMET-designed LNPs demonstrate in vivo efficacy.

In 1:1 in vivo comparisons with LANCE top hits, COMET-optimized LNPs matched DC2.4 benchmarks (Supplementary Fig. 18a,b), but underperformed slightly in B16-F10 (not significant), possibly owing to poor correlation between in vitro B16-F10 data and subcutaneous in vivo response (Supplementary Fig. 18c,d).

Interpretation of COMET’s predictions

We used t-distributed stochastic neighbour embedding (t-SNE) to visualize how COMET encodes LNP compositional features to predict efficacy. As shown in Fig. 6a,b (for virtual LNPs) and Supplementary Fig. 22 (for LANCE LNPs), high-efficacy LNPs form distinct regional clusters within COMET. Notably, a green cluster contains LNPs efficacious in both DC2.4 and B16-F10, while yellow and blue clusters are specific to DC2.4 and B16-F10, respectively (Fig. 6a). Although ionizable lipid choices vary within each group, the DC2.4-specific cluster has a higher prevalence of SM-102 (Fig. 6c), and the B16-F10-specific cluster contains more LNPs with DC-cholesterol (Fig. 6d). The green cluster is enriched in high N/P ratio (25–30) formulations (Fig. 6e). Some of these patterns were observed in the LANCE data (Fig. 2), but others emerged only through COMET’s predictions.

figure 6
Fig. 6: Interpreting COMET.

In general, DOPE and C14-PEG are dominant among highly scored LNPs, suggesting that they are more optimal than DSPC and C18-PEG. Beta-sitosterol is overrepresented in high-scoring LNPs (Fig. 6d), supporting previous findings36,42,43,44. Two composition trends stood out: ionizable lipid % above 50% (Fig. 6f) and sterol % below 10% (Fig. 6g) were both detrimental to efficacy. Additional LNP feature visualizations are shown in Supplementary Fig. 21.

Using integrated gradients45, we identified features most influential to COMET’s predictions (Methods). For DC2.4, lipid identity was the most important factor, followed by N/P ratio and molar percentages (Fig. 6h). Among lipid classes, PEG lipid choice had the highest influence (Fig. 6i), indicating that switching from C18-PEG to C14-PEG improves efficacy predictions. Ionizable lipid choice was the next most critical, consistent with previous work12,46,47,48. For molar composition, ionizable and helper lipid percentages were most impactful (Fig. 6j). Similar trends held for B16-F10 (Fig. 6k–m), except that helper lipid choice ranked slightly higher than ionizable lipid choice.

We also analysed interactions between PBAE (Supplementary Section A.1) and synergistic ionizable lipid combinations (Supplementary Section A.2) with other formulation features. These interactions revealed that optimal material choices and molar percentages depend not only on the target cell type but also on the broader compositional context of the formulation.

Conclusion

The design of COMET is motivated by the importance of not only the molecular structure of individual ingredients (for example, lipids) in drug products but also the interactions among compounds and their relative ratios. Its transformer-based architecture integrates multimodal features—including molecular structures, molar percentages and synthesis parameters—into a unified artificial intelligence framework. This enables COMET to learn LNP formulation features in a data-driven manner, without relying on manually selected physicochemical descriptors27,28. COMET accurately predicts LNP efficacy after training on LANCE, one of the largest LNP datasets so far49,50,51,52, and can distinguish top formulations from less efficacious ones.

While COMET consistently outperforms k-nearest neighbours, its advantage over random forest depends on dataset size and complexity. As larger, more diverse datasets emerge—especially with broader lipid chemistries—COMET’s deep learning architecture will likely offer increasing benefits. High-throughput methods are poised to accelerate this growth.

COMET’s flexible input format enables exploration of non-canonical formulations, such as dual-ionizable lipid LNPs or polymer–lipid hybrids (for example, branched PBAEs). It can screen massive virtual libraries to find formulations that differ substantially from known hits yet yield high performance—such as the L319-based BE-1 LNP. As LNP designs grow in complexity, COMET makes discovery more tractable.

In lead optimization, COMET identified stronger formulations in two out of three cases. In the one failure (LA-580), the parent formulation already had very high efficacy (Fig. 3h). This highlights that while COMET distinguishes top from mediocre LNPs well (‘Performance of COMET’ section), optimizing within a high-performing region requires even greater discriminative power. Adding more data from high-performing LNPs—especially through active learning—could improve this. COMET-predicted hits were validated across in vitro and in vivo settings. Since COMET is trained on in vitro data, and in vitro–in vivo correlation is known to be weak for LNPs22,49, not all predicted hits will succeed in vivo. Future integration with in vivo screening data5,8 may improve performance.

Beyond efficacy, COMET also predicts formulation stability post-lyophilization, despite limited data. This accuracy improves with multitask training using LANCE. Similar gains were observed in adapting COMET to new cell types (for example, Caco-2), underscoring the broad applicability of our approach. This is especially useful in contexts where assays are low throughput and datasets are small. The flexibility of COMET to handle multi-component inputs also allows for its extension beyond conventional LNPs. We demonstrated the model’s adaptability to formulations with non-lipid materials (for example, branched PBAEs) and its utility across multiple cell types. COMET’s architecture may also support links to other areas of nanotechnology where multi-component formulations are critical, such as co-delivery of multiple cargos, immunomodulatory nanoparticle design or materials for tissue engineering. In such contexts, COMET’s compositional encoding and multitask learning structure could be adapted to jointly predict multiple endpoints, including efficacy, toxicity or stability.

Coupled with advances in high-throughput science, we hope that COMET will become an essential tool for formulation development and discovery of knowledge in this field.

Methods

COMET details

This section describes the model architecture and training algorithms of COMET. Pseudocode for inference is provided in Algorithm S1.

COMET model architecture

Lipid molecular structures are encoded into high-dimensional vectors (molecular embeddings), while scalar compositional features are encoded using a Gaussian-based encoder53. Continuous formulation-wide parameters (for example, N/P ratio and volumetric mix ratio) are encoded with Gaussian layers; categorical inputs use one-hot embeddings.

The transformer uses a [CLS] token to aggregate input features across multiple attention layers. For multitask learning, each cell type is assigned a separate [CLS] token and prediction head, enabling task-specific outputs while sharing LNP-level representation learning.

Molecular encoder

COMET is compatible with various molecular encoders; here we use Uni-Mol11, pretrained to recover masked atom types and corrupted three-dimensional coordinates. It offers strong property prediction performance and is used with default hyperparameters (from https://github.com/dptech-corp/Uni-Mol/tree/main/unimol). Pretrained weights are frozen during COMET training. Each compound is encoded into a 512-dimensional vector using atom types and coordinates.

Lipid molar percentages are encoded into 128-dimensional vectors using a shared Gaussian layer. Each component is further assigned a 128-dimensional one-hot embedding ({z}_{k}^{{\rm{type}}}) to distinguish lipid classes. These are concatenated and projected through a two-layer MLP into a 256-dimensional component representation.

N/P ratio and volumetric ratio

N/P ratio is encoded using a separate 256-dimensional Gaussian layer (zN/P). Aqueous/organic ratios, treated as categorical variables, are one-hot encoded (zphase) with 256 dimensions.

CLS token and prediction head

Each cell type uses a learned [CLS] token (zCLS) of dimension 256. These aggregate component and formulation-wide token representations across Nblock transformer layers via attention54. Final predictions are made by passing the [CLS] token through a two-layer MLP (MLPpredict).

Transformer blocks

Each block follows a Pre-LayerNorm structure55 composed of layernorm → self-attention → MLP with residual connections.

Training details

The model is trained with a binary ranking objective56 where, given a pair of LNP samples, the model learns to predict a larger efficacy score for the LNP that has a higher efficacy label value from the other LNP:

(1)

where xh and xl are high- and low-efficacy LNPs and fθ is COMET’s scoring function. Training uses a batch size of 64 (2,016 pairwise comparisons per batch).

Conflict-averse gradient descent

Conflict-averse gradient descent (CAGrad)33 mitigates conflicting gradients in multitask settings. We apply CAGrad with a coefficient of 0.2 to stabilize training across tasks.

 The RNA medicine revolution has been spurred by lipid nanoparticles (LNPs). The effectiveness of an LNP is determined by its lipid components and their ratios; however, experimental optimization is laborious and does not explore the full design space. Computational approaches such as deep learning can be greatly beneficial, but the composite nature of LNPs limits the effectiveness of existing single molecule-based algorithms to LNPs. Addressing this, our approach integrates the multi-component and multimodal features of composite formulations such as LNPs to predict their performance in an end-to-end manner. Here we generate one of the largest LNP datasets (LANCE) by varying LNP formulations to train our deep learning model, COMET. This transformer-based neural network not only accurately predicts the efficacy of LNPs but is adaptable to non-canonical LNP formulations such as those with two ionizable lipids and polymeric materials. Furthermore, COMET can predict LNP performance in a cell line outside of LANCE and predict LNP stability during lyophilization using only small training datasets. Experimental validation showed that our approach can identify LNPs that exhibit strong protein expression in vitro and in vivo, promising accelerated development of nucleic acid therapies with extensive potential across therapeutic and manufacturing applications.

Similar content being viewed by others

Optimization of lipid nanoparticles for the delivery of nebulized therapeutic mRNA to the lungs

Article06 October 2021

Acid-degradable lipid nanoparticles enhance the delivery of mRNA

Article23 August 2024

Reformulating lipid nanoparticles for organ-targeted mRNA accumulation and translation

ArticleOpen access05 July 2024

Main

For clinical1, logistical2 and translational3 success, most drug substances are formulated into drug products with multiple ingredients. Our analysis shows that, on average, eight excipients are present in commercial products4. Given choices of ingredients and their ratios, formulation design presents a vast search space. While high-throughput approaches exist5,<a data-test=”citation-ref” data-track-action=”reference anchor” data-track-label=”link” data-track=”click” href=”https://www.nature.com/articles/s41565-025-01975-4#ref-CR6″ id=”ref-link-section-d101833602e917_1″ style=”background-color: transparent; box-sizing: inherit; color: #006699; overflow-wrap: break-word; text-decoration-skip-ink: auto; text-decoration-thickness: 0.0625rem; text-underline-offset: 0.08em; vertical-align: baseline; word-break: break-word;” title=”Sago, C. D. et al. High-throughput in vivo screen of functional mRNA delivery identifies nanoparticles for endothelial cell gene editing. Proc. Natl Acad. Sci. USA
https://doi.org/10.1073/pnas.1811276115

system, BlinkMacSystemFont, “Segoe UI”, Roboto, Oxygen-Sans, Ubuntu, Cantarell, “Helvetica Neue”, sans-serif; font-size: 1.125rem; font-weight: 400; letter-spacing: -0.0117156rem; line-height: 1.3; margin: 24px 0px 8px;”>Animal experiments
Animal experiments for this study were approved by the Massachusetts Institute of Technology Institutional Animal Care and Use Committee and were consistent with local, state and federal regulations as applicable. Female C57BL/6J mice (000664, The Jackson Laboratory) were used in the experiments. For imagining, d-luciferin (LUCK-1G, Gold Biotechnology) solubilized in PBS was administered via intraperitoneal injection and the mice were imaged using an IVIS imaging system (PerkinElmer).
t-SNE visualization
We selected the COMET model most correlated (Spearman) with ensemble scores across a random virtual LNP subset. LNP features for t-SNE were the final [CLS] token representations. To ensure even distribution across ionizable types, dual-ionizable lipid LNPs were treated as a distinct class, and 1,250 LNPs per class (8 total) were randomly sampled (10,000 total).
Integrated gradients implementation
To execute integrated gradients (IG) with COMET’s multimodal inputs, we adapted the Captum library. IG computes attribution by integrating gradients along a path from reference to input. Feature attributions were computed per LNP, baseline-subtracted and averaged across each group. Non-PBAE LANCE LNPs were used as the baseline. Attribution scores were normalized (max = 1) and averaged across ensemble models.

kamblenayan826

Leave a Reply

Your email address will not be published. Required fields are marked *