LangCell: Language-Cell Pre-training for Cell Identity Understanding
- URL: http://arxiv.org/abs/2405.06708v5
- Date: Tue, 11 Jun 2024 07:31:13 GMT
- Title: LangCell: Language-Cell Pre-training for Cell Identity Understanding
- Authors: Suyuan Zhao, Jiahuan Zhang, Yushuai Wu, Yizhen Luo, Zaiqing Nie,
- Abstract summary: We introduce LangCell, a unified representation of single-cell data and natural language during the pre-training phase.
Results show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios.
- Score: 3.6518971609937068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cell identity encompasses various semantic aspects of a cell, including cell type, pathway information, disease information, and more, which are essential for biologists to gain insights into its biological characteristics. Understanding cell identity from the transcriptomic data, such as annotating cell types, has become an important task in bioinformatics. As these semantic aspects are determined by human experts, it is impossible for AI models to effectively carry out cell identity understanding tasks without the supervision signals provided by single-cell and label pairs. The single-cell pre-trained language models (PLMs) currently used for this task are trained only on a single modality, transcriptomics data, lack an understanding of cell identity knowledge. As a result, they have to be fine-tuned for downstream tasks and struggle when lacking labeled data with the desired semantic labels. To address this issue, we propose an innovative solution by constructing a unified representation of single-cell data and natural language during the pre-training phase, allowing the model to directly incorporate insights related to cell identity. More specifically, we introduce $\textbf{LangCell}$, the first $\textbf{Lang}$uage-$\textbf{Cell}$ pre-training framework. LangCell utilizes texts enriched with cell identity information to gain a profound comprehension of cross-modal knowledge. Results from experiments conducted on different benchmarks show that LangCell is the only single-cell PLM that can work effectively in zero-shot cell identity understanding scenarios, and also significantly outperforms existing models in few-shot and fine-tuning cell identity understanding scenarios.
Related papers
- How to Build the Virtual Cell with Artificial Intelligence: Priorities and Opportunities [46.671834972945874]
We propose a vision of leveraging advances in AI to construct virtual cells.
We discuss desired capabilities of such AI Virtual Cells, including generating universal representations of biological entities.
We envision a future where AI Virtual Cells help identify new drug targets, predict cellular responses to perturbations, as well as scale hypothesis exploration.
arXiv Detail & Related papers (2024-09-18T02:41:50Z) - Cell-ontology guided transcriptome foundation model [18.51941953027685]
We present textbfsingle textbfcell, textbfCell-textbfontology guided TFM scCello.
Our TFM demonstrates competitive and transferability performance over the existing TFMs on biologically important tasks.
arXiv Detail & Related papers (2024-08-22T13:15:49Z) - UniCell: Universal Cell Nucleus Classification via Prompt Learning [76.11864242047074]
We propose a universal cell nucleus classification framework (UniCell)
It employs a novel prompt learning mechanism to uniformly predict the corresponding categories of pathological images from different dataset domains.
In particular, our framework adopts an end-to-end architecture for nuclei detection and classification, and utilizes flexible prediction heads for adapting various datasets.
arXiv Detail & Related papers (2024-02-20T11:50:27Z) - ChatCell: Facilitating Single-Cell Analysis with Natural Language [40.4429032376233]
ChatCell is a tool for facilitating single-cell analysis with natural language.
ChatCell has acquired profound expertise in single-cell biology.
Our project homepage is available at https://zjunlp.io/project/ChatCell.
arXiv Detail & Related papers (2024-02-13T09:06:14Z) - Prediction of Cellular Identities from Trajectory and Cell Fate
Information [0.40964539027092917]
We propose an innovative approach to cell identification during early $textitC. elegansgenesis using machine learning.
We employ random forest, embryo, and LSTM models, and tested cell classification accuracy on 3D time-lapse datasets spanning the first 4 hours of embryogenesis.
Our research demonstrates the success of predicting cell identities in time-lapse imaging sequences directly from simple spatial-temporal features.
arXiv Detail & Related papers (2024-01-11T03:28:13Z) - Single-Cell Deep Clustering Method Assisted by Exogenous Gene
Information: A Novel Approach to Identifying Cell Types [50.55583697209676]
We develop an attention-enhanced graph autoencoder, which is designed to efficiently capture the topological features between cells.
During the clustering process, we integrated both sets of information and reconstructed the features of both cells and genes to generate a discriminative representation.
This research offers enhanced insights into the characteristics and distribution of cells, thereby laying the groundwork for early diagnosis and treatment of diseases.
arXiv Detail & Related papers (2023-11-28T09:14:55Z) - Mixed Models with Multiple Instance Learning [51.440557223100164]
We introduce MixMIL, a framework integrating Generalized Linear Mixed Models (GLMM) and Multiple Instance Learning (MIL)
Our empirical results reveal that MixMIL outperforms existing MIL models in single-cell datasets.
arXiv Detail & Related papers (2023-11-04T16:42:42Z) - Causal machine learning for single-cell genomics [94.28105176231739]
We discuss the application of machine learning techniques to single-cell genomics and their challenges.
We first present the model that underlies most of current causal approaches to single-cell biology.
We then identify open problems in the application of causal approaches to single-cell data.
arXiv Detail & Related papers (2023-10-23T13:35:24Z) - Revolutionizing Single Cell Analysis: The Power of Large Language Models
for Cell Type Annotation [0.0]
Large language models such as ChatGPT and New Bing provide accurate annotations of cell types.
By using ChatGPT to annotate single cell data, we can relate rare cell type to their function.
This can have important applications in understanding cancer progression, mammalian development, and stem cell differentiation.
arXiv Detail & Related papers (2023-04-05T18:45:54Z) - OCELOT: Overlapped Cell on Tissue Dataset for Histopathology [13.691924123273004]
We release OCELOT, a dataset dedicated to the study of cell-tissue relationships for cell detection in histopathology.
We propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously.
On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score.
arXiv Detail & Related papers (2023-03-23T08:57:11Z) - Towards an Automatic Analysis of CHO-K1 Suspension Growth in
Microfluidic Single-cell Cultivation [63.94623495501023]
We propose a novel Machine Learning architecture, which allows us to infuse a neural deep network with human-powered abstraction on the level of data.
Specifically, we train a generative model simultaneously on natural and synthetic data, so that it learns a shared representation, from which a target variable, such as the cell count, can be reliably estimated.
arXiv Detail & Related papers (2020-10-20T08:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.