ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
- URL: http://arxiv.org/abs/2504.09421v2
- Date: Tue, 15 Apr 2025 07:52:40 GMT
- Title: ClinicalGPT-R1: Pushing reasoning capability of generalist disease diagnosis with large language model
- Authors: Wuyang Lan, Wenzheng Wang, Changwei Ji, Guoxing Yang, Yongbo Zhang, Xiaohong Liu, Song Wu, Guangyu Wang,
- Abstract summary: We introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis.<n>Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning.
- Score: 7.058358371583673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in reasoning with large language models (LLMs)has shown remarkable reasoning capabilities in domains such as mathematics and coding, yet their application to clinical diagnosis remains underexplored. Here, we introduce ClinicalGPT-R1, a reasoning enhanced generalist large language model for disease diagnosis. Trained on a dataset of 20,000 real-world clinical records, ClinicalGPT-R1 leverages diverse training strategies to enhance diagnostic reasoning. To benchmark performance, we curated MedBench-Hard, a challenging dataset spanning seven major medical specialties and representative diseases. Experimental results demonstrate that ClinicalGPT-R1 outperforms GPT-4o in Chinese diagnostic tasks and achieves comparable performance to GPT-4 in English settings. This comparative study effectively validates the superior performance of ClinicalGPT-R1 in disease diagnosis tasks. Resources are available at https://github.com/medfound/medfound.
Related papers
- ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.
Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - MedRAG: Enhancing Retrieval-augmented Generation with Knowledge Graph-Elicited Reasoning for Healthcare Copilot [47.77948063906033]
Retrieval-augmented generation (RAG) is a well-suited technique for retrieving privacy-sensitive Electronic Health Records.
This paper proposes MedRAG, a RAG model enhanced by knowledge graph (KG)-elicited reasoning for the medical domain.
Tests show MedRAG provides more specific diagnostic insights and outperforms state-of-the-art models in reducing misdiagnosis rates.
arXiv Detail & Related papers (2025-02-06T12:27:35Z) - Superhuman performance of a large language model on the reasoning tasks of a physician [10.043418251604624]
Performance of large language models (LLMs) on medical tasks has traditionally been evaluated using multiple choice question benchmarks.<n>We evaluate OpenAI's o1-preview model, a model developed to increase run-time via chain of thought processes prior to generating a response.
arXiv Detail & Related papers (2024-12-14T14:46:18Z) - Evaluating the Impact of Lab Test Results on Large Language Models Generated Differential Diagnoses from Clinical Case Vignettes [20.651573628726148]
This study assesses the impact of lab test results on differential diagnoses made by large language models (LLMs)
LLMs GPT-4, GPT-3.5, Llama-2-70b, Claude-2, and Mixtral-8x7B were tested to generate Top 10, Top 5, and Top 1 DDx with and without lab data.
GPT-4 performed best, achieving 55% accuracy for Top 1 diagnoses and 60% for Top 10 with lab data, with lenient accuracy up to 80%.
Lab tests, including liver function, metabolic/toxicology panels, and serology/immune tests, were generally interpreted correctly by
arXiv Detail & Related papers (2024-11-01T02:48:32Z) - Fine-Tuning In-House Large Language Models to Infer Differential Diagnosis from Radiology Reports [1.5972172622800358]
This study introduces a pipeline for developing in-house LLMs tailored to identify differential diagnoses from radiology reports.
evaluated on a set of 1,067 reports annotated by clinicians, the proposed model achieves an average F1 score of 92.1%, which is on par with GPT-4.
arXiv Detail & Related papers (2024-10-11T20:16:25Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - Can GPT-4V(ision) Serve Medical Applications? Case Studies on GPT-4V for
Multimodal Medical Diagnosis [59.35504779947686]
GPT-4V is OpenAI's newest model for multimodal medical diagnosis.
Our evaluation encompasses 17 human body systems.
GPT-4V demonstrates proficiency in distinguishing between medical image modalities and anatomy.
It faces significant challenges in disease diagnosis and generating comprehensive reports.
arXiv Detail & Related papers (2023-10-15T18:32:27Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - The Potential and Pitfalls of using a Large Language Model such as
ChatGPT or GPT-4 as a Clinical Assistant [12.017491902296836]
ChatGPT and GPT-4 have demonstrated promising performance on several medical domain tasks.
We performed two analyses using ChatGPT and GPT-4, one to identify patients with specific medical diagnoses using a real-world large electronic health record database.
For patient assessment, GPT-4 can accurately diagnose three out of four times.
arXiv Detail & Related papers (2023-07-16T21:19:47Z) - ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data
and Comprehensive Evaluation [5.690250818139763]
Large language models have exhibited exceptional performance on various Natural Language Processing (NLP) tasks.
Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience.
We present ClinicalGPT, a language model explicitly designed and optimized for clinical scenarios.
arXiv Detail & Related papers (2023-06-16T16:56:32Z) - Predicting Clinical Diagnosis from Patients Electronic Health Records
Using BERT-based Neural Networks [62.9447303059342]
We show the importance of this problem in medical community.
We present a modification of Bidirectional Representations from Transformers (BERT) model for classification sequence.
We use a large-scale Russian EHR dataset consisting of about 4 million unique patient visits.
arXiv Detail & Related papers (2020-07-15T09:22:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.