scInterpreter: Training Large Language Models to Interpret scRNA-seq
Data for Cell Type Annotation
- URL: http://arxiv.org/abs/2402.12405v1
- Date: Sun, 18 Feb 2024 05:39:00 GMT
- Title: scInterpreter: Training Large Language Models to Interpret scRNA-seq
Data for Cell Type Annotation
- Authors: Cong Li, Meng Xiao, Pengfei Wang, Guihai Feng, Xin Li, Yuanchun Zhou
- Abstract summary: This research focuses on how to train and adapt the Large Language Model with the capability to interpret and distinguish cell types in single-cell RNA sequencing data.
- Score: 15.718901418627366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the inherent limitations of existing Large Language Models in
directly reading and interpreting single-cell omics data, they demonstrate
significant potential and flexibility as the Foundation Model. This research
focuses on how to train and adapt the Large Language Model with the capability
to interpret and distinguish cell types in single-cell RNA sequencing data. Our
preliminary research results indicate that these foundational models excel in
accurately categorizing known cell types, demonstrating the potential of the
Large Language Models as effective tools for uncovering new biological
insights.
Related papers
- A generative framework to bridge data-driven models and scientific theories in language neuroscience [84.76462599023802]
We present generative explanation-mediated validation, a framework for generating concise explanations of language selectivity in the brain.
We show that explanatory accuracy is closely related to the predictive power and stability of the underlying statistical models.
arXiv Detail & Related papers (2024-10-01T15:57:48Z) - Boosting the Capabilities of Compact Models in Low-Data Contexts with Large Language Models and Retrieval-Augmented Generation [2.9921619703037274]
We propose a retrieval augmented generation (RAG) framework backed by a large language model (LLM) to correct the output of a smaller model for the linguistic task of morphological glossing.
We leverage linguistic information to make up for the lack of data and trainable parameters, while allowing for inputs from written descriptive grammars interpreted and distilled through an LLM.
We show that a compact, RAG-supported model is highly effective in data-scarce settings, achieving a new state-of-the-art for this task and our target languages.
arXiv Detail & Related papers (2024-10-01T04:20:14Z) - EEG-Language Modeling for Pathology Detection [0.0]
This study pioneers EEG-language models trained on clinical reports and 15000 EEGs.
Our results indicate that models learn richer representations from being exposed to a variety of report segments.
representations of EEG-language models can significantly improve pathology detection compared to those of EEG-only models.
arXiv Detail & Related papers (2024-09-02T10:03:03Z) - Critical Data Size of Language Models from a Grokking Perspective [35.029074833552656]
We formalize the phase transition under the grokking configuration into the Data Efficiency Hypothesis.
We show that generalization occurs only when language models reach a critical size.
Our results deepen the understanding of language model training, offering a novel perspective on the role of data in the learning mechanism of language models.
arXiv Detail & Related papers (2024-01-19T03:24:36Z) - Interpretable Medical Diagnostics with Structured Data Extraction by
Large Language Models [59.89454513692417]
Tabular data is often hidden in text, particularly in medical diagnostic reports.
We propose a novel, simple, and effective methodology for extracting structured tabular data from textual medical reports, called TEMED-LLM.
We demonstrate that our approach significantly outperforms state-of-the-art text classification models in medical diagnostics.
arXiv Detail & Related papers (2023-06-08T09:12:28Z) - Revolutionizing Single Cell Analysis: The Power of Large Language Models
for Cell Type Annotation [0.0]
Large language models such as ChatGPT and New Bing provide accurate annotations of cell types.
By using ChatGPT to annotate single cell data, we can relate rare cell type to their function.
This can have important applications in understanding cancer progression, mammalian development, and stem cell differentiation.
arXiv Detail & Related papers (2023-04-05T18:45:54Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - CCRL: Contrastive Cell Representation Learning [0.0]
We propose Contrastive Cell Representation Learning (CCRL) model for cell identification in H&E slides.
We show that this model can outperform all currently available cell clustering models by a large margin across two datasets from different tissue types.
arXiv Detail & Related papers (2022-08-12T18:12:03Z) - Learning Contextual Representations for Semantic Parsing with
Generation-Augmented Pre-Training [86.91380874390778]
We present Generation-Augmented Pre-training (GAP), that jointly learns representations of natural language utterances and table schemas by leveraging generation models to generate pre-train data.
Based on experimental results, neural semantics that leverage GAP MODEL obtain new state-of-the-art results on both SPIDER and CRITERIA-TO-generative benchmarks.
arXiv Detail & Related papers (2020-12-18T15:53:50Z) - Unsupervised Domain Adaptation of a Pretrained Cross-Lingual Language
Model [58.27176041092891]
Recent research indicates that pretraining cross-lingual language models on large-scale unlabeled texts yields significant performance improvements.
We propose a novel unsupervised feature decomposition method that can automatically extract domain-specific features from the entangled pretrained cross-lingual representations.
Our proposed model leverages mutual information estimation to decompose the representations computed by a cross-lingual model into domain-invariant and domain-specific parts.
arXiv Detail & Related papers (2020-11-23T16:00:42Z) - Data Augmentation for Spoken Language Understanding via Pretrained
Language Models [113.56329266325902]
Training of spoken language understanding (SLU) models often faces the problem of data scarcity.
We put forward a data augmentation method using pretrained language models to boost the variability and accuracy of generated utterances.
arXiv Detail & Related papers (2020-04-29T04:07:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.