Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture
- URL: http://arxiv.org/abs/2508.21803v1
- Date: Fri, 29 Aug 2025 17:31:24 GMT
- Title: Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture
- Authors: Yeawon Lee, Xiaoyang Wang, Christopher C. Yang,
- Abstract summary: We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap.<n>The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes.<n>A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus.
- Score: 8.072932739333309
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate interpretation of clinical narratives is critical for patient care, but the complexity of these notes makes automation challenging. While Large Language Models (LLMs) show promise, single-model approaches can lack the robustness required for high-stakes clinical tasks. We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap. The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes, simulating the diagnostic reasoning process of synthesizing raw data into an assessment. A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus. We evaluated our MAS against a single-agent baseline on a curated dataset of 420 MIMIC-III notes. The dynamic multi-agent configuration demonstrated consistently improved performance in identifying congestive heart failure, acute kidney injury, and sepsis. Qualitative analysis of the agent debates reveals that this structure effectively surfaces and weighs conflicting evidence, though it can occasionally be susceptible to groupthink. By modeling a clinical team's reasoning process, our system offers a promising path toward more accurate, robust, and interpretable clinical decision support tools.
Related papers
- A Multi-Agent Framework for Interpreting Multivariate Physiological Time Series [9.72130666902599]
We present Vivaldi, a role-structured multi-agent system that explains multivariate physiological time series.<n>Our experiments show that agentic pipelines substantially benefit non-thinking and medically fine-tuned models.<n>We find that explicit tool-based computation is decisive for codifiable clinical metrics, whereas subjective targets, such as pain scores and length of stay, show limited or inconsistent changes.
arXiv Detail & Related papers (2026-03-04T14:55:46Z) - MedCollab: Causal-Driven Multi-Agent Collaboration for Full-Cycle Clinical Diagnosis via IBIS-Structured Argumentation [6.334763475104128]
We present MedCollab, a novel multi-agent framework that emulates the hierarchical consultation workflow of modern hospitals.<n>The framework incorporates a dynamic specialist recruitment mechanism that adaptively assembles clinical and examination agents according to patient-specific symptoms and examination results.
arXiv Detail & Related papers (2026-03-01T14:25:58Z) - AgentsEval: Clinically Faithful Evaluation of Medical Imaging Reports via Multi-Agent Reasoning [73.50200033931148]
We introduce AgentsEval, a multi-agent stream reasoning framework that emulates the collaborative diagnostic workflow of radiologists.<n>By dividing the evaluation process into interpretable steps including criteria definition, evidence extraction, alignment, and consistency scoring, AgentsEval provides explicit reasoning traces and structured clinical feedback.<n> Experimental results demonstrate that AgentsEval delivers clinically aligned, semantically faithful, and interpretable evaluations that remain robust under paraphrastic, semantic, and stylistic perturbations.
arXiv Detail & Related papers (2026-01-23T11:59:13Z) - MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models [15.91764739198419]
We present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics.<n>Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints without accessing real-world electronic health records.
arXiv Detail & Related papers (2026-01-06T13:56:33Z) - ClinDEF: A Dynamic Evaluation Framework for Large Language Models in Clinical Reasoning [58.01333341218153]
We propose ClinDEF, a dynamic framework for assessing clinical reasoning in LLMs through simulated diagnostic dialogues.<n>Our method generates patient cases and facilitates multi-turn interactions between an LLM-based doctor and an automated patient agent.<n>Experiments show that ClinDEF effectively exposes critical clinical reasoning gaps in state-of-the-art LLMs.
arXiv Detail & Related papers (2025-12-29T12:58:58Z) - CP-Env: Evaluating Large Language Models on Clinical Pathways in a Controllable Hospital Environment [29.48544328813161]
We introduce CP-Env, a controllable agentic hospital environment designed to evaluate large language models (LLMs) across end-to-end clinical pathways.<n>Following real hospital adaptive flow of healthcare, it enables branching, long-horizon task execution.<n>Results reveal that most models struggle with pathway hallucinations, exhibiting complexity and losing critical diagnostic details.
arXiv Detail & Related papers (2025-12-11T01:54:55Z) - Simulating Viva Voce Examinations to Evaluate Clinical Reasoning in Large Language Models [51.91760712805404]
We introduce VivaBench, a benchmark for evaluating sequential clinical reasoning in large language models (LLMs)<n>Our dataset consists of 1762 physician-curated clinical vignettes structured as interactive scenarios that simulate a (oral) examination in medical training.<n>Our analysis identified several failure modes that mirror common cognitive errors in clinical practice.
arXiv Detail & Related papers (2025-10-11T16:24:35Z) - Auto-TA: Towards Scalable Automated Thematic Analysis (TA) via Multi-Agent Large Language Models with Reinforcement Learning [3.3212706551453155]
Congenital heart disease (CHD) presents complex, lifelong challenges underrepresented in traditional clinical metrics.<n>We propose a fully automated large language model (LLM) pipeline that performs end-to-end thematic analysis on clinical narratives.
arXiv Detail & Related papers (2025-06-30T16:02:28Z) - Silence is Not Consensus: Disrupting Agreement Bias in Multi-Agent LLMs via Catfish Agent for Clinical Decision Making [80.94208848596215]
We present a new concept called Catfish Agent, a role-specialized LLM designed to inject structured dissent and counter silent agreement.<n>Inspired by the catfish effect'' in organizational psychology, the Catfish Agent is designed to challenge emerging consensus to stimulate deeper reasoning.
arXiv Detail & Related papers (2025-05-27T17:59:50Z) - A Multimodal Multi-Agent Framework for Radiology Report Generation [2.1477122604204433]
Radiology report generation (RRG) aims to automatically produce diagnostic reports from medical images.<n>We propose a multimodal multi-agent framework for RRG that aligns with the stepwise clinical reasoning workflow.
arXiv Detail & Related papers (2025-05-14T20:28:04Z) - TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews [54.35097932763878]
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data.<n>Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews.<n>We demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness.
arXiv Detail & Related papers (2025-03-26T15:58:16Z) - Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow.<n>We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv Detail & Related papers (2024-12-02T15:25:02Z) - AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments [2.567146936147657]
We introduce AgentClinic, a multimodal agent benchmark for evaluating large language models (LLM) in simulated clinical environments.<n>We find that solving MedQA problems in the sequential decision-making format of AgentClinic is considerably more challenging, resulting in diagnostic accuracies that can drop to below a tenth of the original accuracy.
arXiv Detail & Related papers (2024-05-13T17:38:53Z) - AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs.
This setup allows for realistic assessments of LLMs in clinical scenarios.
We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.