Related papers: RareBench: Can LLMs Serve as Rare Diseases Specialists?

RareBench: Can LLMs Serve as Rare Diseases Specialists?

URL: http://arxiv.org/abs/2402.06341v2
Date: Thu, 4 Jul 2024 09:10:17 GMT
Title: RareBench: Can LLMs Serve as Rare Diseases Specialists?
Authors: Xuanzhong Chen, Xiaohao Mao, Qihan Guo, Lun Wang, Shuyang Zhang, Ting Chen,
Abstract summary: Generalist Large Language Models (LLMs) have shown considerable promise in various domains, including medical diagnosis. Rare diseases, affecting approximately 300 million people worldwide, often have unsatisfactory clinical diagnosis rates. RareBench is a pioneering benchmark designed to evaluate the capabilities of LLMs on 4 critical dimensions within the realm of rare diseases. We present an exhaustive comparative study of GPT-4's diagnostic capabilities against those of specialist physicians.
Score: 11.828142771893443
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generalist Large Language Models (LLMs), such as GPT-4, have shown considerable promise in various domains, including medical diagnosis. Rare diseases, affecting approximately 300 million people worldwide, often have unsatisfactory clinical diagnosis rates primarily due to a lack of experienced physicians and the complexity of differentiating among many rare diseases. In this context, recent news such as "ChatGPT correctly diagnosed a 4-year-old's rare disease after 17 doctors failed" underscore LLMs' potential, yet underexplored, role in clinically diagnosing rare diseases. To bridge this research gap, we introduce RareBench, a pioneering benchmark designed to systematically evaluate the capabilities of LLMs on 4 critical dimensions within the realm of rare diseases. Meanwhile, we have compiled the largest open-source dataset on rare disease patients, establishing a benchmark for future studies in this domain. To facilitate differential diagnosis of rare diseases, we develop a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases, significantly enhancing LLMs' diagnostic performance. Moreover, we present an exhaustive comparative study of GPT-4's diagnostic capabilities against those of specialist physicians. Our experimental findings underscore the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. This paves the way for exciting possibilities in future advancements in this field.

Related papers

An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [58.78045864541539]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv Detail & Related papers (2025-06-25T13:42:26Z)
Rare Disease Differential Diagnosis with Large Language Models at Scale: From Abdominal Actinomycosis to Wilson's Disease [8.81420331399616]
Large language models (LLMs) have demonstrated impressive capabilities in disease diagnosis. Rare disease performance is critical with the increasing use of LLMs in healthcare settings. We propose RareScale to combine the knowledge LLMs with expert systems.
arXiv Detail & Related papers (2025-02-20T22:02:52Z)
RareAgents: Advancing Rare Disease Care through LLM-Empowered Multi-disciplinary Team [13.330661181655493]
Rare diseases collectively impact around 300 million people worldwide due to the vast number of diseases. Recent agents powered by large language models (LLMs) have demonstrated notable applications across various domains. RareAgents integrates advanced Multidisciplinary Team (MDT) coordination, memory mechanisms, and medical tools utilization, leveraging Llama-3.1-8B/70B as the base model.
arXiv Detail & Related papers (2024-12-17T02:22:24Z)
MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.765466293296186]
Recent progress in Medical Large Vision-Language Models (Med-LVLMs) has opened up new possibilities for interactive diagnostic tools. Med-LVLMs often suffer from factual hallucination, which can lead to incorrect diagnoses. We propose a versatile multimodal RAG system, MMed-RAG, designed to enhance the factuality of Med-LVLMs.
arXiv Detail & Related papers (2024-10-16T23:03:27Z)
Prompting Large Language Models for Supporting the Differential Diagnosis of Anemia [0.8602553195689511]
In practice, clinicians achieve a diagnosis by following a sequence of steps, such as laboratory exams, observations, or imaging. The pathways to reach diagnosis decisions are documented by guidelines authored by expert organizations, which guide clinicians to reach a correct diagnosis through these sequences of steps. Our study aimed to develop pathways similar to those that can be obtained in clinical guidelines. We tested three Large Language Models (LLMs) -Generative Pretrained Transformer 4 (GPT-4), Large Language Model Meta AI (LLaMA), and Mistral -on a synthetic yet realistic dataset to differentially diagnose anemia and its subtypes.
arXiv Detail & Related papers (2024-09-20T06:47:36Z)
Assessing and Enhancing Large Language Models in Rare Disease Question-answering [64.32570472692187]
We introduce a rare disease question-answering (ReDis-QA) dataset to evaluate the performance of Large Language Models (LLMs) in diagnosing rare diseases. We collected 1360 high-quality question-answer pairs within the ReDis-QA dataset, covering 205 rare diseases. We then benchmarked several open-source LLMs, revealing that diagnosing rare diseases remains a significant challenge for these models. Experiment results demonstrate that ReCOP can effectively improve the accuracy of LLMs on the ReDis-QA dataset by an average of 8%.
arXiv Detail & Related papers (2024-08-15T21:09:09Z)
GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals. GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date. It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv Detail & Related papers (2024-08-06T17:59:21Z)
Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision. This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z)
Digital Diagnostics: The Potential Of Large Language Models In Recognizing Symptoms Of Common Illnesses [0.2995925627097048]
This study evaluates each model diagnostic abilities by interpreting a user symptoms and determining diagnoses that fit well with common illnesses. GPT-4 demonstrates higher diagnostic accuracy from its deep and complete history of training on medical data. Gemini performs with high precision as a critical tool in disease triage, demonstrating its potential to be a reliable model.
arXiv Detail & Related papers (2024-05-09T15:12:24Z)
AutoRD: An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontologies-enhanced Large Language Models [25.966454809890227]
Rare diseases affect millions worldwide but often face limited research focus due to their low prevalence. Recent advancements in Large Language Models (LLMs) have shown promise in automating the extraction of medical information. We propose an end-to-end system called AutoRD, which automates the extraction of information from medical texts about rare diseases.
arXiv Detail & Related papers (2024-03-01T20:06:39Z)
Asclepius: A Spectrum Evaluation Benchmark for Medical Multi-Modal Large Language Models [59.60384461302662]
We introduce Asclepius, a novel benchmark for evaluating Medical Multi-Modal Large Language Models (Med-MLLMs) Asclepius rigorously and comprehensively assesses model capability in terms of distinct medical specialties and different diagnostic capacities. We also provide an in-depth analysis of 6 Med-MLLMs and compare them with 5 human specialists.
arXiv Detail & Related papers (2024-02-17T08:04:23Z)
Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for Multimodal Medical Diagnosis [59.35504779947686]
GPT-4V is OpenAI's newest model for multimodal medical diagnosis. Our evaluation encompasses 17 human body systems. GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy. It faces significant challenges in disease diagnosis and generating comprehensive reports.
arXiv Detail & Related papers (2023-10-15T18:32:27Z)
Exploring deep learning methods for recognizing rare diseases and their clinical manifestations from texts [1.6328866317851187]
Approximately 300 million people are affected by a rare disease. The early and accurate diagnosis of these conditions is a major challenge for general practitioners, who do not have enough knowledge to identify them. Natural Language Processing (NLP) and Deep Learning can help to extract relevant information to facilitate their diagnosis and treatments.
arXiv Detail & Related papers (2021-09-01T12:35:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.