TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology
- URL: http://arxiv.org/abs/2503.03485v1
- Date: Wed, 05 Mar 2025 13:24:57 GMT
- Title: TEDDY: A Family Of Foundation Models For Understanding Single Cell Biology
- Authors: Alexis Chevalier, Soumya Ghosh, Urvi Awasthi, James Watkins, Julia Bieniewska, Nichita Mitrea, Olga Kotova, Kirill Shkura, Andrew Noble, Michael Steinbaugh, Julien Delile, Christoph Meier, Leonid Zhukov, Iya Khalil, Srayanta Mukherjee, Judith Mueller,
- Abstract summary: Existing foundation models either do not improve or only modestly improve over task-specific models in downstream applications.<n>We scaled the pre-training dataset to 116 million cells, which is larger than those used by previous models.<n>We trained the TEDDY family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters.
- Score: 6.289686541194788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding the biological mechanism of disease is critical for medicine, and in particular drug discovery. AI-powered analysis of genome-scale biological data hold great potential in this regard. The increasing availability of single-cell RNA sequencing data has enabled the development of large foundation models for disease biology. However, existing foundation models either do not improve or only modestly improve over task-specific models in downstream applications. Here, we explored two avenues for improving the state-of-the-art. First, we scaled the pre-training dataset to 116 million cells, which is larger than those used by previous models. Second, we leveraged the availability of large-scale biological annotations as a form of supervision during pre-training. We trained the TEDDY family of models comprising six transformer-based state-of-the-art single-cell foundation models with 70 million, 160 million, and 400 million parameters. We vetted our models on two downstream evaluation tasks -- identifying the underlying disease state of held-out donors not seen during training and distinguishing healthy cells from diseased ones for disease conditions and donors not seen during training. Scaling experiments showed that performance improved predictably with both data volume and parameter count. Our models showed substantial improvement over existing work on the first task and more muted improvements on the second.
Related papers
- Celler:A Genomic Language Model for Long-Tailed Single-Cell Annotation [15.026701157315966]
We introduce Celler, a state-of-the-art generative pre-training model crafted specifically for the annotation of single-cell data.
By dynamically adjusting sample weights, GInf Loss significantly enhances the model's ability to learn from rare categories.
We have constructed a large-scale single-cell dataset: Celler-75, which encompasses 40 million cells distributed across 80 human tissues and 75 specific diseases.
arXiv Detail & Related papers (2025-03-28T02:04:26Z) - Deep Learning Approaches for Blood Disease Diagnosis Across Hematopoietic Lineages [0.0]
We present a foundation modeling framework that leverages deep learning to uncover latent genetic signatures across the hematopoietic hierarchy.
Our approach trains a fully connected autoencoder on multipotent progenitor cells, reducing over 20,000 gene features to a 256-dimensional latent space.
We validate the quality of these embeddings by training feed-forward, transformer, and graph convolutional architectures for blood disease diagnosis tasks.
Our models achieve greater than 95% accuracy for multi-class classification, and in the zero-shot setting, we achieve greater than 0.7 F1-score on the binary classification task.
arXiv Detail & Related papers (2025-03-25T20:11:10Z) - Biomedical Foundation Model: A Survey [84.26268124754792]
Foundation models are large-scale pre-trained models that learn from extensive unlabeled datasets.<n>These models can be adapted to various applications such as question answering and visual understanding.<n>This survey explores the potential of foundation models across diverse domains within biomedical fields.
arXiv Detail & Related papers (2025-03-03T22:42:00Z) - METAGENE-1: Metagenomic Foundation Model for Pandemic Monitoring [13.988975730867107]
We pretrain a metagenomic foundation model, METAGENE-1, on a novel corpus of diverse metagenomic DNA and RNA sequences.<n>This dataset is sourced from a large collection of human wastewater samples, processed and sequenced using deep metagenomic sequencing methods.<n>We show results of pretraining this model on our metagenomic dataset, providing details about our losses, system metrics, and training stability over the course of pretraining.
arXiv Detail & Related papers (2025-01-03T18:44:43Z) - How Good Are We? Evaluating Cell AI Foundation Models in Kidney Pathology with Human-in-the-Loop Enrichment [11.60167559546617]
Training AI foundation models have emerged as a promising large-scale learning approach for addressing real-world healthcare challenges.
While many of these models have been developed for tasks like disease diagnosis and tissue quantification, their readiness for deployment on some arguably simplest tasks, such as nuclei segmentation within a single organ, remains uncertain.
This paper seeks to answer this key question, "How good are we?" by thoroughly evaluating the performance of recent cell foundation models on a curated dataset.
arXiv Detail & Related papers (2024-10-31T17:00:33Z) - Benchmarking foundation models as feature extractors for weakly-supervised computational pathology [0.6151041580858937]
We benchmarked 19 histopathology foundation models on 13 patient cohorts with 6,818 patients and 9,528 slides from lung, colorectal, gastric, and breast cancers.<n>We show that a vision-language foundation model, CONCH, yielded the highest performance when compared to vision-only foundation models, with Virchow2 as close second.
arXiv Detail & Related papers (2024-08-28T14:34:45Z) - Using Pre-training and Interaction Modeling for ancestry-specific disease prediction in UK Biobank [69.90493129893112]
Recent genome-wide association studies (GWAS) have uncovered the genetic basis of complex traits, but show an under-representation of non-European descent individuals.
Here, we assess whether we can improve disease prediction across diverse ancestries using multiomic data.
arXiv Detail & Related papers (2024-04-26T16:39:50Z) - DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology [1.3551232282678036]
We introduce DinoBloom, the first foundation model for single cell images in hematology.
Our model is built upon an extensive collection of 13 diverse, publicly available datasets of peripheral blood and bone marrow smears.
A family of four DinoBloom models can be adapted for a wide range of downstream applications.
arXiv Detail & Related papers (2024-04-07T17:25:52Z) - Tertiary Lymphoid Structures Generation through Graph-based Diffusion [54.37503714313661]
In this work, we leverage state-of-the-art graph-based diffusion models to generate biologically meaningful cell-graphs.
We show that the adopted graph diffusion model is able to accurately learn the distribution of cells in terms of their tertiary lymphoid structures (TLS) content.
arXiv Detail & Related papers (2023-10-10T14:37:17Z) - SANSformers: Self-Supervised Forecasting in Electronic Health Records
with Attention-Free Models [48.07469930813923]
This work aims to forecast the demand for healthcare services, by predicting the number of patient visits to healthcare facilities.
We introduce SANSformer, an attention-free sequential model designed with specific inductive biases to cater for the unique characteristics of EHR data.
Our results illuminate the promising potential of tailored attention-free models and self-supervised pretraining in refining healthcare utilization predictions across various patient demographics.
arXiv Detail & Related papers (2021-08-31T08:23:56Z) - A multi-stage machine learning model on diagnosis of esophageal
manometry [50.591267188664666]
The framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage.
This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data.
arXiv Detail & Related papers (2021-06-25T20:09:23Z) - Select-ProtoNet: Learning to Select for Few-Shot Disease Subtype
Prediction [55.94378672172967]
We focus on few-shot disease subtype prediction problem, identifying subgroups of similar patients.
We introduce meta learning techniques to develop a new model, which can extract the common experience or knowledge from interrelated clinical tasks.
Our new model is built upon a carefully designed meta-learner, called Prototypical Network, that is a simple yet effective meta learning machine for few-shot image classification.
arXiv Detail & Related papers (2020-09-02T02:50:30Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.