ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation
- URL: http://arxiv.org/abs/2505.00017v1
- Date: Thu, 24 Apr 2025 01:05:22 GMT
- Title: ReCellTy: Domain-specific knowledge graph retrieval-augmented LLMs workflow for single-cell annotation
- Authors: Dezheng Han, Yibin Jia, Ruxiao Chen, Wenjie Han, Shuaishuai Guo, Jianbo Wang,
- Abstract summary: We developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction.<n>Our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types.
- Score: 8.31906400360507
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: To enable precise and fully automated cell type annotation with large language models (LLMs), we developed a graph structured feature marker database to retrieve entities linked to differential genes for cell reconstruction. We further designed a multi task workflow to optimize the annotation process. Compared to general purpose LLMs, our method improves human evaluation scores by up to 0.21 and semantic similarity by 6.1% across 11 tissue types, while more closely aligning with the cognitive logic of manual annotation.
Related papers
- GRIT: Graph-Regularized Logit Refinement for Zero-shot Cell Type Annotation [15.465706196179676]
Cell type annotation is a fundamental step in the analysis of single-cell RNA sequencing (scRNA-seq) data.<n>Recent advances in CLIP-style models offer a promising path toward automating cell type annotation.<n>In this paper, we propose to refine the zero-shot logits produced by LangCell through a graph-regularized optimization framework.
arXiv Detail & Related papers (2025-08-06T07:09:46Z) - Large Language Models are Good Relational Learners [55.40941576497973]
We introduce Rel-LLM, a novel architecture that utilizes a graph neural network (GNN)- based encoder to generate structured relational prompts for large language models (LLMs)<n>Unlike traditional text-based serialization approaches, our method preserves the inherent relational structure of databases while enabling LLMs to process and reason over complex entity relationships.
arXiv Detail & Related papers (2025-06-06T04:07:55Z) - OrderChain: A General Prompting Paradigm to Improve Ordinal Understanding Ability of MLLM [28.249198952483685]
This paper presents OrderChain, a novel and general prompting paradigm that improves the ordinal understanding ability of MLLMs by specificity and commonality modeling.<n> Comprehensive experiments show that a Large Language and Vision Assistant model with our OrderChain improves baseline LLaVA significantly on diverse OR datasets.<n>To our best knowledge, our OrderChain is the first work that augments MLLMs for OR tasks, and the effectiveness is witnessed across a spectrum of OR datasets.
arXiv Detail & Related papers (2025-04-07T07:53:44Z) - Model Generalization on Text Attribute Graphs: Principles with Large Language Models [14.657522068231138]
Large language models (LLMs) have been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce.<n>We develop a framework for inference over text-attributed graphs (TAGs) based on task-adaptive embeddings and a generalizable graph information aggregation mechanism.<n> Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-02-17T14:31:00Z) - Single-Cell Omics Arena: A Benchmark Study for Large Language Models on Cell Type Annotation Using Single-Cell Data [13.56585855722118]
Large language models (LLMs) have demonstrated their ability to efficiently process and synthesize vast corpora of text to automatically extract biological knowledge.<n>Our study explores the potential of LLMs to accurately classify and annotate cell types in single-cell RNA sequencing (scRNA-seq) data.<n>The results demonstrate that LLMs can provide robust interpretations of single-cell data without requiring additional fine-tuning.
arXiv Detail & Related papers (2024-12-03T23:58:35Z) - How to Make LLMs Strong Node Classifiers? [70.14063765424012]
Language Models (LMs) are challenging the dominance of domain-specific models, such as Graph Neural Networks (GNNs) and Graph Transformers (GTs)<n>We propose a novel approach that empowers off-the-shelf LMs to achieve performance comparable to state-of-the-art (SOTA) GNNs on node classification tasks.
arXiv Detail & Related papers (2024-10-03T08:27:54Z) - Interpetable Target-Feature Aggregation for Multi-Task Learning based on Bias-Variance Analysis [53.38518232934096]
Multi-task learning (MTL) is a powerful machine learning paradigm designed to leverage shared knowledge across tasks to improve generalization and performance.
We propose an MTL approach at the intersection between task clustering and feature transformation based on a two-phase iterative aggregation of targets and features.
In both phases, a key aspect is to preserve the interpretability of the reduced targets and features through the aggregation with the mean, which is motivated by applications to Earth science.
arXiv Detail & Related papers (2024-06-12T08:30:16Z) - Learning to Extract Structured Entities Using Language Models [52.281701191329]
Recent advances in machine learning have significantly impacted the field of information extraction.
We reformulate the task to be entity-centric, enabling the use of diverse metrics.
We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP metric.
arXiv Detail & Related papers (2024-02-06T22:15:09Z) - Distantly Supervised Morpho-Syntactic Model for Relation Extraction [0.27195102129094995]
We present a method for the extraction and categorisation of an unrestricted set of relationships from text.
We evaluate our approach on six datasets built on Wikidata and Wikipedia.
arXiv Detail & Related papers (2024-01-18T14:17:40Z) - From Quantity to Quality: Boosting LLM Performance with Self-Guided Data Selection for Instruction Tuning [52.257422715393574]
We introduce a self-guided methodology for Large Language Models (LLMs) to autonomously discern and select cherry samples from open-source datasets.
Our key innovation, the Instruction-Following Difficulty (IFD) metric, emerges as a pivotal metric to identify discrepancies between a model's expected responses and its intrinsic generation capability.
arXiv Detail & Related papers (2023-08-23T09:45:29Z) - Controllable Data Augmentation for Few-Shot Text Mining with Chain-of-Thought Attribute Manipulation [35.33340453046864]
Chain-of-Thought Attribute Manipulation (CoTAM) is a novel approach that generates new data from existing examples.
We leverage the chain-of-thought prompting to directly edit the text in three steps, (1) attribute decomposition, (2) manipulation proposal, and (3) sentence reconstruction.
arXiv Detail & Related papers (2023-07-14T00:10:03Z) - Large Language Models for Automated Data Science: Introducing CAAFE for
Context-Aware Automated Feature Engineering [52.09178018466104]
We introduce Context-Aware Automated Feature Engineering (CAAFE) to generate semantically meaningful features for datasets.
Despite being methodologically simple, CAAFE improves performance on 11 out of 14 datasets.
We highlight the significance of context-aware solutions that can extend the scope of AutoML systems to semantic AutoML.
arXiv Detail & Related papers (2023-05-05T09:58:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.