DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
- URL: http://arxiv.org/abs/2507.07870v1
- Date: Thu, 10 Jul 2025 15:52:04 GMT
- Title: DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
- Authors: Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-Tür, Tarek Abdelzaher,
- Abstract summary: DocCHA is a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages.<n> evaluated on two real-world Chinese consultation datasets.
- Score: 17.975659876934895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations -- paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.
Related papers
- M3CoTBench: Benchmark Chain-of-Thought of MLLMs in Medical Image Understanding [66.78251988482222]
Chain-of-Thought (CoT) reasoning has proven effective in enhancing large language models by encouraging step-by-step intermediate reasoning.<n>Current benchmarks for medical image understanding generally focus on the final answer while ignoring the reasoning path.<n>M3CoTBench aims to foster the development of transparent, trustworthy, and diagnostically accurate AI systems for healthcare.
arXiv Detail & Related papers (2026-01-13T17:42:27Z) - Benchmarking Egocentric Clinical Intent Understanding Capability for Medical Multimodal Large Language Models [48.95516224614331]
We introduce MedGaze-Bench, the first benchmark leveraging clinician gaze as a Cognitive Cursor to assess intent understanding across surgery, emergency simulation, and diagnostic interpretation.<n>Our benchmark addresses three fundamental challenges: visual homogeneity of anatomical structures, strict temporal-causal dependencies in clinical, and implicit adherence to safety protocols.
arXiv Detail & Related papers (2026-01-11T02:20:40Z) - MedKGI: Iterative Differential Diagnosis with Medical Knowledge Graphs and Information-Guided Inquiring [39.776843923694244]
We propose MedKGI, a diagnostic framework grounded in clinical practices.<n>We show that MedKGI outperforms strong LLM baselines in both diagnostic accuracy and inquiry efficiency.
arXiv Detail & Related papers (2025-12-30T12:31:53Z) - ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z) - Timely Clinical Diagnosis through Active Test Selection [49.091903570068155]
We propose ACTMED (Adaptive Clinical Test selection via Model-based Experimental Design) to better emulate real-world diagnostic reasoning.<n>LLMs act as flexible simulators, generating plausible patient state distributions and supporting belief updates without requiring structured, task-specific training data.<n>We evaluate ACTMED on real-world datasets and show it can optimize test selection to improve diagnostic accuracy, interpretability, and resource use.
arXiv Detail & Related papers (2025-10-21T18:10:45Z) - Trustworthy AI Psychotherapy: Multi-Agent LLM Workflow for Counseling and Explainable Mental Disorder Diagnosis [11.025486717604972]
DSM5AgentFlow is the first LLM-based agent workflow designed to autonomously generate DSM-5 Level-1 diagnostic questionnaires.<n>By simulating therapist-client dialogues with specific client profiles, the framework delivers transparent, step-by-step disorder predictions.<n>This workflow serves as a complementary tool for mental health diagnosis, ensuring adherence to ethical and legal standards.
arXiv Detail & Related papers (2025-08-15T11:08:32Z) - Uncertainty-Driven Expert Control: Enhancing the Reliability of Medical Vision-Language Models [52.2001050216955]
Existing methods aim to enhance the performance of Medical Vision Language Model (MedVLM) by adjusting model structure, fine-tuning with high-quality data, or through preference fine-tuning.<n>We propose an expert-in-the-loop framework named Expert-Controlled-Free Guidance (Expert-CFG) to align MedVLM with clinical expertise without additional training.
arXiv Detail & Related papers (2025-07-12T09:03:30Z) - Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making [80.94208848596215]
We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement.<n>Inspired by the catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning.
arXiv Detail & Related papers (2025-05-27T17:59:50Z) - 3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark [0.29987253996125257]
3MDBench is an open-source framework for simulating and evaluating LVLM-driven telemedical consultations.<n> multimodal dialogue with internal reasoning improves F1 score by 6.5% over non-dialogue settings.<n> injecting predictions from a diagnostic convolutional network into the LVLM's context boosts F1 by up to 20%.
arXiv Detail & Related papers (2025-03-26T07:32:05Z) - Quantifying the Reasoning Abilities of LLMs on Real-world Clinical Cases [48.87360916431396]
We introduce MedR-Bench, a benchmarking dataset of 1,453 structured patient cases, annotated with reasoning references.<n>We propose a framework encompassing three critical examination recommendation, diagnostic decision-making, and treatment planning, simulating the entire patient care journey.<n>Using this benchmark, we evaluate five state-of-the-art reasoning LLMs, including DeepSeek-R1, OpenAI-o3-mini, and Gemini-2.0-Flash Thinking, etc.
arXiv Detail & Related papers (2025-03-06T18:35:39Z) - Structured Outputs Enable General-Purpose LLMs to be Medical Experts [50.02627258858336]
Large language models (LLMs) often struggle with open-ended medical questions.<n>We propose a novel approach utilizing structured medical reasoning.<n>Our approach achieves the highest Factuality Score of 85.8, surpassing fine-tuned models.
arXiv Detail & Related papers (2025-03-05T05:24:55Z) - The Reliability of LLMs for Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness [0.0]
Large Language Models (LLMs) offer promise for democratizing healthcare with advanced diagnostics.<n>This study assesses their diagnostic reliability focusing on consistency, manipulation resilience, and contextual integration.<n>LLMs' vulnerability to manipulation and limited contextual awareness pose challenges in clinical use.
arXiv Detail & Related papers (2025-03-02T11:50:16Z) - Ask Patients with Patience: Enabling LLMs for Human-Centric Medical Dialogue with Grounded Reasoning [25.068780967617485]
Large language models (LLMs) offer a potential solution but struggle in real-world clinical interactions.<n>We propose Ask Patients with Patience (APP), a multi-turn LLM-based medical assistant designed for grounded reasoning, transparent diagnoses, and human-centric interaction.<n>APP enhances communication by eliciting user symptoms through empathetic dialogue, significantly improving accessibility and user engagement.
arXiv Detail & Related papers (2025-02-11T00:13:52Z) - SemioLLM: Evaluating Large Language Models for Diagnostic Reasoning from Unstructured Clinical Narratives in Epilepsy [45.2233252981348]
Large Language Models (LLMs) have been shown to encode clinical knowledge.<n>We present SemioLLM, an evaluation framework that benchmarks 6 state-of-the-art models.<n>We show that most LLMs are able to accurately and confidently generate probabilistic predictions of seizure onset zones in the brain.
arXiv Detail & Related papers (2024-07-03T11:02:12Z) - MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning [36.400896909161006]
We develop systems that proactively ask questions to gather more information and respond reliably.
We introduce a benchmark - MediQ - to evaluate question-asking ability in LLMs.
arXiv Detail & Related papers (2024-06-03T01:32:52Z) - Beyond Self-Consistency: Ensemble Reasoning Boosts Consistency and Accuracy of LLMs in Cancer Staging [0.33554367023486936]
Cancer staging status is available in clinical reports, but it requires natural language processing to extract it.
With the advance in clinical-oriented large language models, it is promising to extract such status without extensive efforts in training the algorithms.
In this study, we propose an ensemble reasoning approach with the aim of improving the consistency of the model generations.
arXiv Detail & Related papers (2024-04-19T19:34:35Z) - KNSE: A Knowledge-aware Natural Language Inference Framework for
Dialogue Symptom Status Recognition [69.78432481474572]
We propose a novel framework called KNSE for symptom status recognition (SSR)
For each mentioned symptom in a dialogue window, we first generate knowledge about the symptom and hypothesis about status of the symptom, to form a (premise, knowledge, hypothesis) triplet.
The BERT model is then used to encode the triplet, which is further processed by modules including utterance aggregation, self-attention, cross-attention, and GRU to predict the symptom status.
arXiv Detail & Related papers (2023-05-26T11:23:26Z) - Clinical Camel: An Open Expert-Level Medical Language Model with
Dialogue-Based Knowledge Encoding [31.884600238089405]
We present Clinical Camel, an open large language model (LLM) explicitly tailored for clinical research.
Fine-tuned from LLaMA-2 using QLoRA, Clinical Camel achieves state-of-the-art performance across medical benchmarks among openly available medical LLMs.
arXiv Detail & Related papers (2023-05-19T23:07:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.