MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for   Inpatient Pathways
        - URL: http://arxiv.org/abs/2503.13205v1
 - Date: Mon, 17 Mar 2025 14:14:28 GMT
 - Title: MAP: Evaluation and Multi-Agent Enhancement of Large Language Models for   Inpatient Pathways
 - Authors: Zhen Chen, Zhihao Peng, Xusheng Liang, Cheng Wang, Peigan Liang, Linsheng Zeng, Minjie Ju, Yixuan Yuan, 
 - Abstract summary: Inpatient pathways demand complex clinical decision-making based on comprehensive patient information.<n>We propose the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents.<n>Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B.
 - Score: 26.013336927642765
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Inpatient pathways demand complex clinical decision-making based on comprehensive patient information, posing critical challenges for clinicians. Despite advancements in large language models (LLMs) in medical applications, limited research focused on artificial intelligence (AI) inpatient pathways systems, due to the lack of large-scale inpatient datasets. Moreover, existing medical benchmarks typically concentrated on medical question-answering and examinations, ignoring the multifaceted nature of clinical decision-making in inpatient settings. To address these gaps, we first developed the Inpatient Pathway Decision Support (IPDS) benchmark from the MIMIC-IV database, encompassing 51,274 cases across nine triage departments and 17 major disease categories alongside 16 standardized treatment options. Then, we proposed the Multi-Agent Inpatient Pathways (MAP) framework to accomplish inpatient pathways with three clinical agents, including a triage agent managing the patient admission, a diagnosis agent serving as the primary decision maker at the department, and a treatment agent providing treatment plans. Additionally, our MAP framework includes a chief agent overseeing the inpatient pathways to guide and promote these three clinician agents. Extensive experiments showed our MAP improved the diagnosis accuracy by 25.10% compared to the state-of-the-art LLM HuatuoGPT2-13B. It is worth noting that our MAP demonstrated significant clinical compliance, outperforming three board-certified clinicians by 10%-12%, establishing a foundation for inpatient pathways systems. 
 
       
      
        Related papers
        - Medical Reasoning in the Era of LLMs: A Systematic Review of Enhancement   Techniques and Applications [59.721265428780946]
Large Language Models (LLMs) in medicine have enabled impressive capabilities, yet a critical gap remains in their ability to perform systematic, transparent, and verifiable reasoning.<n>This paper provides the first systematic review of this emerging field.<n>We propose a taxonomy of reasoning enhancement techniques, categorized into training-time strategies and test-time mechanisms.
arXiv  Detail & Related papers  (2025-08-01T14:41:31Z) - An Agentic System for Rare Disease Diagnosis with Traceable Reasoning [58.78045864541539]
We introduce DeepRare, the first rare disease diagnosis agentic system powered by a large language model (LLM)<n>DeepRare generates ranked diagnostic hypotheses for rare diseases, each accompanied by a transparent chain of reasoning.<n>The system demonstrates exceptional diagnostic performance among 2,919 diseases, achieving 100% accuracy for 1013 diseases.
arXiv  Detail & Related papers  (2025-06-25T13:42:26Z) - DiagnosisArena: Benchmarking Diagnostic Reasoning for Large Language   Models [25.13622249539088]
DiagnosisArena is a benchmark designed to rigorously assess professional-level diagnostic competence.<n> DiagnosisArena consists of 1,113 pairs of segmented patient cases and corresponding diagnoses, spanning 28 medical specialties.<n>Our study reveals that even the most advanced reasoning models, o3, o1, and DeepSeek-R1, achieve only 51.12%, 31.09%, and 17.79% accuracy, respectively.
arXiv  Detail & Related papers  (2025-05-20T09:14:53Z) - ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning   through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.
Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv  Detail & Related papers  (2025-04-29T16:48:23Z) - Towards Conversational AI for Disease Management [29.189384095061722]
Articulate Medical Intelligence Explorer (AMIE) is an agentic system optimised for clinical management and dialogue.<n>AMIE is non-inferior to PCPs in management reasoning as assessed by specialist physicians.<n>AMIE's strong performance across evaluations marks a significant step towards conversational AI as a tool in disease management.
arXiv  Detail & Related papers  (2025-03-08T05:48:58Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv  Detail & Related papers  (2025-03-06T18:35:39Z) - Medchain: Bridging the Gap Between LLM Agents and Clinical Practice   through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow.<n>We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv  Detail & Related papers  (2024-12-02T15:25:02Z) - Towards Evaluating and Building Versatile Large Language Models for   Medicine [57.49547766838095]
We present MedS-Bench, a benchmark designed to evaluate the performance of large language models (LLMs) in clinical contexts.
MedS-Bench spans 11 high-level clinical tasks, including clinical report summarization, treatment recommendations, diagnosis, named entity recognition, and medical concept explanation.
MedS-Ins comprises 58 medically oriented language corpora, totaling 13.5 million samples across 122 tasks.
arXiv  Detail & Related papers  (2024-08-22T17:01:34Z) - Development of a Large Language Model-based Multi-Agent Clinical   Decision Support System for Korean Triage and Acuity Scale (KTAS)-Based   Triage and Treatment Planning in Emergency Departments [0.0]
This study presents an LLM-driven CDSS to assist ED physicians and nurses in patient triage, treatment planning, and overall emergency care management.
The system comprises four AI agents emulating key ED roles: Triage Nurse, Emergency Physician, Pharmacist, and ED Coordinator.
It incorporates the Korean Triage and Acuity Scale (KTAS) for triage assessment and integrates with the RxNorm API for medication management.
arXiv  Detail & Related papers  (2024-08-14T13:03:41Z) - GMAI-MMBench: A Comprehensive Multimodal Evaluation Benchmark Towards   General Medical AI [67.09501109871351]
Large Vision-Language Models (LVLMs) are capable of handling diverse data types such as imaging, text, and physiological signals.
GMAI-MMBench is the most comprehensive general medical AI benchmark with well-categorized data structure and multi-perceptual granularity to date.
It is constructed from 284 datasets across 38 medical image modalities, 18 clinical-related tasks, 18 departments, and 4 perceptual granularities in a Visual Question Answering (VQA) format.
arXiv  Detail & Related papers  (2024-08-06T17:59:21Z) - Beyond Direct Diagnosis: LLM-based Multi-Specialist Agent Consultation
  for Automatic Diagnosis [30.943705201552643]
We propose a framework to model the diagnosis process in the real world by adaptively fusing probability distributions of agents over potential diseases.
Our approach requires significantly less parameter updating and training time, enhancing efficiency and practical utility.
arXiv  Detail & Related papers  (2024-01-29T12:25:30Z) - Towards Accurate Differential Diagnosis with Large Language Models [37.48155380562073]
Interactive interfaces powered by Large Language Models (LLMs) present new opportunities to both assist and automate aspects of differential diagnosis.
20 clinicians evaluated 302 challenging, real-world medical cases sourced from the New England Journal of Medicine.
Our study suggests that our LLM has potential to improve clinicians' diagnostic reasoning and accuracy in challenging cases.
arXiv  Detail & Related papers  (2023-11-30T19:55:51Z) - RECAP-KG: Mining Knowledge Graphs from Raw GP Notes for Remote COVID-19
  Assessment in Primary Care [45.43645878061283]
We present a framework that performs knowledge graph construction from raw GP medical notes written during or after patient consultations.
Our knowledge graphs include information about existing patient symptoms, their duration, and their severity.
We apply our framework to consultation notes of COVID-19 patients in the UK.
arXiv  Detail & Related papers  (2023-06-17T23:35:51Z) - Inheritance-guided Hierarchical Assignment for Clinical Automatic
  Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making.
We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv  Detail & Related papers  (2021-01-27T13:16:51Z) - SmartTriage: A system for personalized patient data capture,
  documentation generation, and decision support [9.09817311390571]
We developed a machine-learning-backed system, SmartTriage, which goes beyond conventional symptom checking through a tight bi-directional integration with the electronic medical record (EMR)
SmartTriage identifies the patient's chief complaint from a free-text entry and then asks a series of discrete questions to obtain relevant symptomatology.
The patient-specific data are used to predict detailed ICD-10-CM codes as well as medication, laboratory, and imaging orders.
arXiv  Detail & Related papers  (2020-10-19T22:45:27Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.