scInterpreter: Training Large Language Models to Interpret scRNA-seq
Data for Cell Type Annotation
- URL: http://arxiv.org/abs/2402.12405v1
- Date: Sun, 18 Feb 2024 05:39:00 GMT
- Title: scInterpreter: Training Large Language Models to Interpret scRNA-seq
Data for Cell Type Annotation
- Authors: Cong Li, Meng Xiao, Pengfei Wang, Guihai Feng, Xin Li, Yuanchun Zhou
- Abstract summary: This research focuses on how to train and adapt the Large Language Model with the capability to interpret and distinguish cell types in single-cell RNA sequencing data.
- Score: 15.718901418627366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the inherent limitations of existing Large Language Models in
directly reading and interpreting single-cell omics data, they demonstrate
significant potential and flexibility as the Foundation Model. This research
focuses on how to train and adapt the Large Language Model with the capability
to interpret and distinguish cell types in single-cell RNA sequencing data. Our
preliminary research results indicate that these foundational models excel in
accurately categorizing known cell types, demonstrating the potential of the
Large Language Models as effective tools for uncovering new biological
insights.
Related papers
- Transformer-based Single-Cell Language Model: A Survey [5.228439173541588]
We provide a detailed introduction about the structure and principles of transformers.
We review the single-cell language models and large language models for single-cell data analysis.
We discuss the challenges of single-cell language models and provide promising research directions.
arXiv Detail & Related papers (2024-07-18T06:43:12Z) - Scalable Amortized GPLVMs for Single Cell Transcriptomics Data [9.010523724015398]
Dimensionality reduction is crucial for analyzing large-scale single-cell RNA-seq data.
We introduce an improved model, the amortized variational model (BGPLVM)
BGPLVM is tailored for single-cell RNA-seq with specialized encoder, kernel, and likelihood designs.
arXiv Detail & Related papers (2024-05-06T21:54:38Z) - Critical Data Size of Language Models from a Grokking Perspective [35.029074833552656]
We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis.
We show that generalization occurs only when language models reach a critical size.
Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
arXiv Detail & Related papers (2024-01-19T03:24:36Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Revolutionizing Single Cell Analysis: The Power of Large Language Models
for Cell Type Annotation [0.0]
Large language models such as ChatGPT and New Bing provide accurate annotations of cell types.
By using ChatGPT to annotate single cell data, we can relate rare cell type to their function.
This can have important applications in understanding cancer progression, mammalian development, and stem cell differentiation.
arXiv Detail & Related papers (2023-04-05T18:45:54Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - CCRL: Contrastive Cell Representation Learning [0.0]
We propose Contrastive Cell Representation Learning (CCRL) model for cell identification in H&E slides.
We show that this model can outperform all currently available cell clustering models by a large margin across two datasets from different tissue types.
arXiv Detail & Related papers (2022-08-12T18:12:03Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Multi-timescale Representation Learning in LSTM Language Models [69.98840820213937]
Language models must capture statistical dependencies between words at timescales ranging from very short to very long.
We derived a theory for how the memory gating mechanism in long short-term memory language models can capture power law decay.
Experiments showed that LSTM language models trained on natural English text learn to approximate this theoretical distribution.
arXiv Detail & Related papers (2020-09-27T02:13:38Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.