Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data
- URL: http://arxiv.org/abs/2509.24840v2
- Date: Fri, 10 Oct 2025 12:41:57 GMT
- Title: Cell2Text: Multimodal LLM for Generating Single-Cell Descriptions from RNA-Seq Data
- Authors: Oussama Kharouiche, Aris Markogiannakis, Xiao Fei, Michail Chatzianastasis, Michalis Vazirgiannis,
- Abstract summary: We introduce Cell2Text, a framework that translates scRNA-seq profiles into structured natural language descriptions.<n>By integrating gene-level embeddings with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity.
- Score: 17.440176654185095
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Single-cell RNA sequencing has transformed biology by enabling the measurement of gene expression at cellular resolution, providing information for cell types, states, and disease contexts. Recently, single-cell foundation models have emerged as powerful tools for learning transferable representations directly from expression profiles, improving performance on classification and clustering tasks. However, these models are limited to discrete prediction heads, which collapse cellular complexity into predefined labels that fail to capture the richer, contextual explanations biologists need. We introduce Cell2Text, a multimodal generative framework that translates scRNA-seq profiles into structured natural language descriptions. By integrating gene-level embeddings from single-cell foundation models with pretrained large language models, Cell2Text generates coherent summaries that capture cellular identity, tissue origin, disease associations, and pathway activity, generalizing to unseen cells. Empirically, Cell2Text outperforms baselines on classification accuracy, demonstrates strong ontological consistency using PageRank-based similarity metrics, and achieves high semantic fidelity in text generation. These results demonstrate that coupling expression data with natural language offers both stronger predictive performance and inherently interpretable outputs, pointing to a scalable path for label-efficient characterization of unseen cells.
Related papers
- Contrastive Learning Enhances Language Model Based Cell Embeddings for Low-Sample Single Cell Transcriptomics [3.7907528918903797]
Large language models (LLMs) have shown ability in generating rich representations across domains such as natural language processing and generation, computer vision, and multimodal learning.<n>We present a computational framework that integrates single-cell RNA sequencing (scRNA-seq) with LLMs to derive knowledge-informed gene embeddings.
arXiv Detail & Related papers (2025-09-28T00:45:39Z) - Bridging Large Language Models and Single-Cell Transcriptomics in Dissecting Selective Motor Neuron Vulnerability [1.9638866836733835]
We generate biologically contextualized cell embeddings using gene-specific textual annotations from the NCBI Gene database.<n>For each cell in a single-cell RNA sequencing (scRNA-seq) dataset, we rank genes by expression level, retrieve their NCBI Gene descriptions, and transform these descriptions into vector embedding representations.
arXiv Detail & Related papers (2025-05-12T03:39:33Z) - CellVerse: Do Large Language Models Really Understand Cell Biology? [74.34984441715517]
We introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data.<n>We systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse.
arXiv Detail & Related papers (2025-05-09T06:47:23Z) - Language-Enhanced Representation Learning for Single-Cell Transcriptomics [27.33236345953242]
We propose scMMGPT, a novel framework designed for language-enhanced representation learning in single-cell transcriptomics.<n> scMMGPT employs robust cell representation extraction, preserving quantitative gene expression data, and introduces an innovative two-stage pre-training strategy.
arXiv Detail & Related papers (2025-03-12T14:26:16Z) - A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following [32.67347401145835]
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks.<n>We present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis.<n>InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands.
arXiv Detail & Related papers (2025-01-14T15:12:19Z) - scReader: Prompting Large Language Models to Interpret scRNA-seq Data [12.767105992391555]
We propose an innovative hybrid approach that integrates the general knowledge capabilities of large language models with domain-specific representation models for single-cell omics data interpretation.<n>By inputting single-cell gene-level expression data with prompts, we effectively model cellular representations based on the differential expression levels of genes across various species and cell types.
arXiv Detail & Related papers (2024-12-24T04:28:42Z) - Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data.<n>CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z) - VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling [60.91599380893732]
VQDNA is a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning.
By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings.
arXiv Detail & Related papers (2024-05-13T20:15:03Z) - Efficient and Scalable Fine-Tune of Language Models for Genome
Understanding [49.606093223945734]
We present textscLingo: textscLanguage prefix ftextscIne-tuning for textscGentextscOmes.
Unlike DNA foundation models, textscLingo strategically leverages natural language foundation models' contextual cues.
textscLingo further accommodates numerous downstream fine-tune tasks by an adaptive rank sampling method.
arXiv Detail & Related papers (2024-02-12T21:40:45Z) - scBiGNN: Bilevel Graph Representation Learning for Cell Type
Classification from Single-cell RNA Sequencing Data [62.87454293046843]
Graph neural networks (GNNs) have been widely used for automatic cell type classification.
scBiGNN comprises two GNN modules to identify cell types.
scBiGNN outperforms a variety of existing methods for cell type classification from scRNA-seq data.
arXiv Detail & Related papers (2023-12-16T03:54:26Z) - Mixed Models with Multiple Instance Learning [51.440557223100164]
We introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL)
Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets.
arXiv Detail & Related papers (2023-11-04T16:42:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.