Parameter-free representations outperform single-cell foundation models on downstream benchmarks
- URL: http://arxiv.org/abs/2602.16696v1
- Date: Wed, 18 Feb 2026 18:42:29 GMT
- Title: Parameter-free representations outperform single-cell foundation models on downstream benchmarks
- Authors: Huan Souza, Pankaj Mehta,
- Abstract summary: Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure.<n>Large-scale foundation models, such as TranscriptFormer, learn a generative model for gene expression by embedding genes into a latent vector space.<n>We ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single-cell RNA sequencing (scRNA-seq) data exhibit strong and reproducible statistical structure. This has motivated the development of large-scale foundation models, such as TranscriptFormer, that use transformer-based architectures to learn a generative model for gene expression by embedding genes into a latent vector space. These embeddings have been used to obtain state-of-the-art (SOTA) performance on downstream tasks such as cell-type classification, disease-state prediction, and cross-species learning. Here, we ask whether similar performance can be achieved without utilizing computationally intensive deep learning-based representations. Using simple, interpretable pipelines that rely on careful normalization and linear methods, we obtain SOTA or near SOTA performance across multiple benchmarks commonly used to evaluate single-cell foundation models, including outperforming foundation models on out-of-distribution tasks involving novel cell types and organisms absent from the training data. Our findings highlight the need for rigorous benchmarking and suggest that the biology of cell identity can be captured by simple linear representations of single cell gene expression data.
Related papers
- Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data [17.440176654185095]
We introduce Cell2Text, a framework that translates scRNA-seq profiles into structured natural language descriptions.<n>By integrating gene-level embeddings with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity.
arXiv Detail & Related papers (2025-09-29T14:20:50Z) - Hyperbolic Genome Embeddings [0.6656737591902598]
We develop a novel application of hyperbolic CNNs that exploits the evolutionarily-informed structure of biological systems.<n>Our strategy circumvents the need for explicit phylogenetic mapping while discerning key properties of sequences.<n>Our approach even surpasses state-of-the-art performance on seven GUE benchmark datasets.
arXiv Detail & Related papers (2025-07-29T10:06:17Z) - GENERator: A Long-Context Generative Genomic Foundation Model [66.46537421135996]
We present GENERator, a generative genomic foundation model featuring a context length of 98k base pairs (bp) and 1.2B parameters.<n>Trained on an expansive dataset comprising 386B bp of DNA, the GENERator demonstrates state-of-the-art performance across both established and newly proposed benchmarks.<n>It also shows significant promise in sequence optimization, particularly through the prompt-responsive generation of enhancer sequences with specific activity profiles.
arXiv Detail & Related papers (2025-02-11T05:39:49Z) - Stochastic gradient descent estimation of generalized matrix factorization models with application to single-cell RNA sequencing data [39.146761527401424]
Single-cell RNA sequencing allows the quantification of gene expression at the individual cell level.<n> Dimensionality reduction is a common preprocessing step critical for the visualization, clustering, and phenotypic characterization of samples.<n>We present a generalized matrix factorization model assuming a general exponential dispersion family distribution.<n>We show that our method scales seamlessly to millions of cells, enabling dimensionality reduction in large single-cell datasets.
arXiv Detail & Related papers (2024-12-29T16:02:15Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - Semantically Rich Local Dataset Generation for Explainable AI in Genomics [0.716879432974126]
Black box deep learning models trained on genomic sequences excel at predicting the outcomes of different gene regulatory mechanisms.
We propose using Genetic Programming to generate datasets by evolving perturbations in sequences that contribute to their semantic diversity.
arXiv Detail & Related papers (2024-07-03T10:31:30Z) - BEACON: Benchmark for Comprehensive RNA Tasks and Language Models [60.02663015002029]
We introduce the first comprehensive RNA benchmark BEACON (textbfBEnchmtextbfArk for textbfCOmprehensive RtextbfNA Task and Language Models).<n>First, BEACON comprises 13 distinct tasks derived from extensive previous work covering structural analysis, functional studies, and engineering applications.<n>Second, we examine a range of models, including traditional approaches like CNNs, as well as advanced RNA foundation models based on language models, offering valuable insights into the task-specific performances of these models.<n>Third, we investigate the vital RNA language model components
arXiv Detail & Related papers (2024-06-14T19:39:19Z) - Scalable Amortized GPLVMs for Single Cell Transcriptomics Data [9.010523724015398]
Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data.
We introduce an improved model, the amortized variational model (BGPLVM)
BGPLVM is tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs.
arXiv Detail & Related papers (2024-05-06T21:54:38Z) - scBiGNN: Bilevel Graph Representation Learning for Cell Type
Classification from Single-cell RNA Sequencing Data [62.87454293046843]
Graph neural networks (GNNs) have been widely used for automatic cell type classification.
scBiGNN comprises two GNN modules to identify cell types.
scBiGNN outperforms a variety of existing methods for cell type classification from scRNA-seq data.
arXiv Detail & Related papers (2023-12-16T03:54:26Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Goal-directed Generation of Discrete Structures with Conditional
Generative Models [85.51463588099556]
We introduce a novel approach to directly optimize a reinforcement learning objective, maximizing an expected reward.
We test our methodology on two tasks: generating molecules with user-defined properties and identifying short python expressions which evaluate to a given target value.
arXiv Detail & Related papers (2020-10-05T20:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.