Related papers: Few shot chain-of-thought driven reasoning to prompt LLMs for open ended medical question answering

Related papers

Beyond MedQA: Towards Real-world Clinical Decision Making in the Era of LLMs [37.6690828097719]
Large language models (LLMs) show promise for clinical use.<n>Many medical datasets rely on simplified Question-Answering (QA) that underrepresents real-world clinical decision-making.<n>We propose a unifying paradigm that characterizes clinical decision-making tasks along two dimensions: Clinical Backgrounds and Clinical Questions.
arXiv Detail & Related papers (2025-10-22T20:06:10Z)
Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z)
GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning [60.03671205298294]
Medical visual question answering aims to support clinical decision-making by enabling models to answer natural language questions based on medical images.<n>Current methods still suffer from limited answer reliability and poor interpretability.<n>This work first proposes a Region-Aware Multimodal Chain-of-Thought dataset, in which the process of producing an answer is preceded by a sequence of intermediate reasoning steps.
arXiv Detail & Related papers (2025-06-22T08:09:58Z)
Enhancing Step-by-Step and Verifiable Medical Reasoning in MLLMs [23.50838763761289]
We propose Mentor-Intern Collaborative Search (MICS) to generate rigorous and effective medical chain-of-thought data.<n>The reasoning performance is determined by an MICS-Score, which assesses the quality of generated reasoning paths.<n>Eventually, we construct MMRP, a multi-task medical reasoning dataset with ranked difficulty, and Chiron-o1, a new medical MLLM devised via a curriculum learning strategy.
arXiv Detail & Related papers (2025-06-20T12:51:19Z)
Language Agents for Hypothesis-driven Clinical Decision Making with Reinforcement Learning [38.49879425944787]
We propose to model clinical decision-making for diagnosis with a hypothesis-driven uncertainty-aware language agent, LA-CDM.<n>We train LA-CDM with three objectives targeting critical aspects of clinical decision-making: accurate hypothesis generation, hypothesis uncertainty estimation, and efficient decision-making.<n>We evaluate our methodology on MIMIC-CDM, a real-world dataset covering four abdominal diseases.
arXiv Detail & Related papers (2025-06-16T13:32:01Z)
MAGI: Multi-Agent Guided Interview for Psychiatric Assessment [50.6150986786028]
We present MAGI, the first framework that transforms the gold-standard Mini International Neuropsychiatric Interview (MINI) into automatic computational navigation. We show that MAGI advances LLM- assisted mental health assessment by combining clinical rigor, conversational adaptability, and explainable reasoning.
arXiv Detail & Related papers (2025-04-25T11:08:27Z)
Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions. We propose a novel approach utilizing structured medical reasoning. Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z)
ClinKD: Cross-Modal Clinical Knowledge Distiller For Multi-Task Medical Images [4.353855760968461]
Cross-Modal Clinical Knowledge Distiller (ClinKD) designed to enhance image-text alignment and establish more effective medical knowledge adaptation mechanisms. ClinKD achieves state-of-the-art performance on the Med-GRIT-270k dataset, a challenging medical benchmark containing fine-grained multi-task QA pairs.
arXiv Detail & Related papers (2025-02-09T15:08:10Z)
MedCoT: Medical Chain of Thought via Hierarchical Expert [48.91966620985221]
This paper presents MedCoT, a novel hierarchical expert verification reasoning chain method. It is designed to enhance interpretability and accuracy in biomedical imaging inquiries. Experimental evaluations on four standard Med-VQA datasets demonstrate that MedCoT surpasses existing state-of-the-art approaches.
arXiv Detail & Related papers (2024-12-18T11:14:02Z)
Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow. We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv Detail & Related papers (2024-12-02T15:25:02Z)
Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs) We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets. Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z)
Large Language Models in the Clinic: A Comprehensive Benchmark [63.21278434331952]
We build a benchmark ClinicBench to better understand large language models (LLMs) in the clinic. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. We then construct six novel datasets and clinical tasks that are complex but common in real-world practice. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings.
arXiv Detail & Related papers (2024-04-25T15:51:06Z)
IryoNLP at MEDIQA-CORR 2024: Tackling the Medical Error Detection & Correction Task On the Shoulders of Medical Agents [0.0]
This paper presents MedReAct'N'MedReFlex, which leverages a suite of four medical agents to detect and correct errors in clinical notes. One core component of our method is our RAG pipeline based on our ClinicalCorp corpora. Our results demonstrate the central role of our RAG approach with ClinicalCorp leveraged through the MedReAct'N'MedReFlex framework.
arXiv Detail & Related papers (2024-04-23T20:00:37Z)
ArgMed-Agents: Explainable Clinical Decision Reasoning with LLM Disscusion via Argumentation Schemes [7.950883198425716]
ArgMed-Agents is a framework to enable large language models (LLMs) to make explainable clinical decision reasoning through interaction. We construct a formal model of ArgMed-Agents and present conjectures for theoretical guarantees. setup experiments show that ArgMed-Agents not only improves accuracy in complex clinical decision reasoning problems compared to other prompt methods, but more importantly, it provides users with decision explanations that increase their confidence.
arXiv Detail & Related papers (2024-03-10T19:47:00Z)
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions [19.436999992810797]
We construct two new datasets: JAMA Clinical Challenge and Medbullets. JAMA Clinical Challenge consists of questions based on challenging clinical cases, while Medbullets comprises simulated clinical questions. We evaluate seven LLMs on the two datasets using various prompts.
arXiv Detail & Related papers (2024-02-28T05:44:41Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Large Language Models are Clinical Reasoners: Reasoning-Aware Diagnosis Framework with Prompt-Generated Rationales [15.362903610463285]
We present a "reasoning-aware" diagnosis framework that rationalizes the diagnostic process via prompt-based learning. We propose a novel set of criteria for evaluating machine-generated rationales' potential for real-world clinical settings.
arXiv Detail & Related papers (2023-12-12T16:14:45Z)
CLIP in Medical Imaging: A Comprehensive Survey [59.429714742927956]
Contrastive Language-Image Pre-training successfully introduces text supervision to vision models. It has shown promising results across various tasks, attributable to its generalizability and interpretability. Use of CLIP has recently gained increasing interest in the medical imaging domain.
arXiv Detail & Related papers (2023-12-12T15:21:57Z)
Diagnostic Reasoning Prompts Reveal the Potential for Large Language Model Interpretability in Medicine [4.773117448586697]
We develop novel diagnostic reasoning prompts to study whether large language models (LLMs) can perform clinical reasoning to accurately form a diagnosis. We find GPT4 can be prompted to mimic the common clinical reasoning processes of clinicians without sacrificing diagnostic accuracy.
arXiv Detail & Related papers (2023-08-13T19:04:07Z)
VBridge: Connecting the Dots Between Features, Explanations, and Data for Healthcare Models [85.4333256782337]
VBridge is a visual analytics tool that seamlessly incorporates machine learning explanations into clinicians' decision-making workflow. We identified three key challenges, including clinicians' unfamiliarity with ML features, lack of contextual information, and the need for cohort-level evidence. We demonstrated the effectiveness of VBridge through two case studies and expert interviews with four clinicians.
arXiv Detail & Related papers (2021-08-04T17:34:13Z)
Inheritance-guided Hierarchical Assignment for Clinical Automatic Diagnosis [50.15205065710629]
Clinical diagnosis, which aims to assign diagnosis codes for a patient based on the clinical note, plays an essential role in clinical decision-making. We propose a novel framework to combine the inheritance-guided hierarchical assignment and co-occurrence graph propagation for clinical automatic diagnosis.
arXiv Detail & Related papers (2021-01-27T13:16:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.