Related papers: LangCell: Language-Cell Pre-training for Cell Identity Understanding

LangCell: Language-Cell Pre-training for Cell Identity Understanding

URL: http://arxiv.org/abs/2405.06708v5
Date: Tue, 11 Jun 2024 07:31:13 GMT
Title: LangCell: Language-Cell Pre-training for Cell Identity Understanding
Authors: Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie,
Abstract summary: We introduce LangCell, a unified representation of single-cell data and natural language during the pre-training phase. Results show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios.
Score: 3.6518971609937068
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, has become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained language models (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce $\textbf{LangCell}$, the first $\textbf{Lang}$uage-$\textbf{Cell}$ pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.

Related papers

Cell-o1: Training LLMs to Solve Single-Cell Reasoning Puzzles with Reinforcement Learning [44.91329557101423]
We introduce the CellPuzzles task, where the objective is to assign unique cell types to a batch of cells.<n>This benchmark spans diverse tissues, diseases, and donor conditions, and requires reasoning across the batch-level cellular context to ensure label uniqueness.<n>We propose Cell-o1, a 7B LLM trained via supervised fine-tuning on distilled reasoning traces, followed by reinforcement learning with batch-level rewards.
arXiv Detail & Related papers (2025-06-03T14:16:53Z)
CellVerse: Do Large Language Models Really Understand Cell Biology? [74.34984441715517]
We introduce CellVerse, a unified language-centric question-answering benchmark that integrates four types of single-cell multi-omics data.<n>We systematically evaluate the performance across 14 open-source and closed-source LLMs ranging from 160M to 671B on CellVerse.
arXiv Detail & Related papers (2025-05-09T06:47:23Z)
A Multi-Modal AI Copilot for Single-Cell Analysis with Instruction Following [32.67347401145835]
Large language models excel at interpreting complex natural language instructions, enabling them to perform a wide range of tasks. We present InstructCell, a multi-modal AI copilot that leverages natural language as a medium for more direct and flexible single-cell analysis. InstructCell empowers researchers to accomplish critical tasks-such as cell type annotation, conditional pseudo-cell generation, and drug sensitivity prediction-using straightforward natural language commands.
arXiv Detail & Related papers (2025-01-14T15:12:19Z)
Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data [13.56585855722118]
Large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract biological knowledge. Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data. The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-12-03T23:58:35Z)
Cell as Point: One-Stage Framework for Efficient Cell Tracking [54.19259129722988]
This paper proposes the novel end-to-end CAP framework to achieve efficient and stable cell tracking in one stage. CAP abandons detection or segmentation stages and simplifies the process by exploiting the correlation among the trajectories of cell points to track cells jointly. Cap demonstrates strong cell tracking performance while also being 10 to 55 times more efficient than existing methods.
arXiv Detail & Related papers (2024-11-22T10:16:35Z)
How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities [46.671834972945874]
We propose a vision of leveraging advances in AI to construct virtual cells. We discuss desired capabilities of such AI Virtual Cells, including generating universal representations of biological entities. We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration.
arXiv Detail & Related papers (2024-09-18T02:41:50Z)
Cell-ontology guided transcriptome foundation model [18.51941953027685]
We present textbfsingle textbfcell, textbfCell-textbfontology guided TFM scCello. Our TFM demonstrates competitive and transferability performance over the existing TFMs on biologically important tasks.
arXiv Detail & Related papers (2024-08-22T13:15:49Z)
Multi-Modal and Multi-Attribute Generation of Single Cells with CFGen [76.02070962797794]
This work introduces CellFlow for Generation (CFGen), a flow-based conditional generative model that preserves the inherent discreteness of single-cell data. CFGen generates whole-genome multi-modal single-cell data reliably, improving the recovery of crucial biological data characteristics.
arXiv Detail & Related papers (2024-07-16T14:05:03Z)
UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell) It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains. In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z)
ChatCell: Facilitating Single-Cell Analysis with Natural Language [40.4429032376233]
ChatCell is a tool for facilitating single-cell analysis with natural language. ChatCell has acquired profound expertise in single-cell biology. Our project homepage is available at https://zjunlp.io/project/ChatCell.
arXiv Detail & Related papers (2024-02-13T09:06:14Z)
Prediction of Cellular Identities from Trajectory and Cell Fate Information [0.40964539027092917]
We propose an innovative approach to cell identification during early $textitC. elegansgenesis using machine learning. We employ random forest, embryo, and LSTM models, and tested cell classification accuracy on 3D time-lapse datasets spanning the first 4 hours of embryogenesis. Our research demonstrates the success of predicting cell identities in time-lapse imaging sequences directly from simple spatial-temporal features.
arXiv Detail & Related papers (2024-01-11T03:28:13Z)
Single-Cell Deep Clustering Method Assisted by Exogenous Gene Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells. During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation. This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z)
Mixed Models with Multiple Instance Learning [51.440557223100164]
We introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL) Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets.
arXiv Detail & Related papers (2023-11-04T16:42:42Z)
Revolutionizing Single Cell Analysis: The Power of Large Language Models for Cell Type Annotation [0.0]
Large language models such as ChatGPT and New Bing provide accurate annotations of cell types. By using ChatGPT to annotate single cell data, we can relate rare cell type to their function. This can have important applications in understanding cancer progression, mammalian development, and stem cell differentiation.
arXiv Detail & Related papers (2023-04-05T18:45:54Z)
OCELOT: Overlapped Cell on Tissue Dataset for Histopathology [13.691924123273004]
We release OCELOT, a dataset dedicated to the study of cell-tissue relationships for cell detection in histopathology. We propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score.
arXiv Detail & Related papers (2023-03-23T08:57:11Z)
Towards an Automatic Analysis of CHO-K1 Suspension Growth in Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data. Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.