Related papers: Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation

URL: http://arxiv.org/abs/2507.10911v1
Date: Tue, 15 Jul 2025 02:01:38 GMT
Title: Lessons Learned from Evaluation of LLM based Multi-agents in Safer Therapy Recommendation
Authors: Yicong Wu, Ting Chen, Irit Hochberg, Zhoujian Sun, Ruth Edry, Zhengxing Huang, Mor Peleg,
Abstract summary: This study investigated the feasibility and value of using a Large Language Model (LLM)-based multi-agent system for safer therapy recommendations.<n>We designed a single agent and a MAS framework simulating multidisciplinary team (MDT) decision-making.<n>We compared MAS performance with single-agent approaches and real-world benchmarks.
Score: 9.84660526673816
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Therapy recommendation for chronic patients with multimorbidity is challenging due to risks of treatment conflicts. Existing decision support systems face scalability limitations. Inspired by the way in which general practitioners (GP) manage multimorbidity patients, occasionally convening multidisciplinary team (MDT) collaboration, this study investigated the feasibility and value of using a Large Language Model (LLM)-based multi-agent system (MAS) for safer therapy recommendations. We designed a single agent and a MAS framework simulating MDT decision-making by enabling discussion among LLM agents to resolve medical conflicts. The systems were evaluated on therapy planning tasks for multimorbidity patients using benchmark cases. We compared MAS performance with single-agent approaches and real-world benchmarks. An important contribution of our study is the definition of evaluation metrics that go beyond the technical precision and recall and allow the inspection of clinical goals met and medication burden of the proposed advices to a gold standard benchmark. Our results show that with current LLMs, a single agent GP performs as well as MDTs. The best-scoring models provide correct recommendations that address all clinical goals, yet the advices are incomplete. Some models also present unnecessary medications, resulting in unnecessary conflicts between medication and conditions or drug-drug interactions.

Related papers

MMedAgent-RL: Optimizing Multi-Agent Collaboration for Multimodal Medical Reasoning [63.63542462400175]
We propose MMedAgent-RL, a reinforcement learning-based multi-agent framework that enables dynamic, optimized collaboration among medical agents.<n> Specifically, we train two GP agents based on Qwen2.5-VL via RL: the triage doctor learns to assign patients to appropriate specialties, while the attending physician integrates the judgments from multi-specialists.<n>Experiments on five medical VQA benchmarks demonstrate that MMedAgent-RL not only outperforms both open-source and proprietary Med-LVLMs, but also exhibits human-like reasoning patterns.
arXiv Detail & Related papers (2025-05-31T13:22:55Z)
DoctorAgent-RL: A Multi-Agent Collaborative Reinforcement Learning System for Multi-Turn Clinical Dialogue [5.0037050098387805]
Large language models (LLMs) have demonstrated excellent capabilities in the field of biomedical question answering, but their application in real-world clinical consultations still faces core challenges.<n>We propose DoctorAgent-RL, a reinforcement learning (RL)-based multi-agent collaborative framework that models medical consultations as a dynamic decision-making process under uncertainty.<n>Experiments demonstrate that DoctorAgent-RL outperforms existing models in both multi-turn reasoning capability and final diagnostic performance.
arXiv Detail & Related papers (2025-05-26T07:48:14Z)
Med-CoDE: Medical Critique based Disagreement Evaluation Framework [72.42301910238861]
The reliability and accuracy of large language models (LLMs) in medical contexts remain critical concerns.<n>Current evaluation methods often lack robustness and fail to provide a comprehensive assessment of LLM performance.<n>We propose Med-CoDE, a specifically designed evaluation framework for medical LLMs to address these challenges.
arXiv Detail & Related papers (2025-04-21T16:51:11Z)
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews [54.35097932763878]
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data.<n>Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews.<n>We demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness.
arXiv Detail & Related papers (2025-03-26T15:58:16Z)
Addressing Overprescribing Challenges: Fine-Tuning Large Language Models for Medication Recommendation Tasks [46.95099594570405]
Medication recommendation systems have garnered attention within healthcare for their potential to deliver personalized and efficacious drug combinations based on patient's clinical data.<n>Existing methodologies encounter challenges in adapting to diverse Electronic Health Records (EHR) systems.<n>We propose Language-Assisted Medication Recommendation (LAMO), which employs a parameter-efficient fine-tuning approach to tailor open-source LLMs for optimal performance in medication recommendation scenarios.
arXiv Detail & Related papers (2025-03-05T17:28:16Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Demystifying Large Language Models for Medicine: A Primer [50.83806796466396]
Large language models (LLMs) represent a transformative class of AI tools capable of revolutionizing various aspects of healthcare. This tutorial aims to equip healthcare professionals with the tools necessary to effectively integrate LLMs into clinical practice.
arXiv Detail & Related papers (2024-10-24T15:41:56Z)
Exploring LLM-based Data Annotation Strategies for Medical Dialogue Preference Alignment [22.983780823136925]
This research examines the use of Reinforcement Learning from AI Feedback (RLAIF) techniques to improve healthcare dialogue models. We argue that the primary challenges in current RLAIF research for healthcare are the limitations of automated evaluation methods. We present a new evaluation framework based on standardized patient examinations.
arXiv Detail & Related papers (2024-10-05T10:29:19Z)
Empowering Clinicians with Medical Decision Transformers: A Framework for Sepsis Treatment [5.0005174003014865]
We propose the medical decision transformer (MeDT) to solve tasks in safety-critical settings. MeDT uses the decision transformer architecture to learn a policy for drug dosage recommendation. MeDT captures complex dependencies among a patient's medical history, treatment decisions, outcomes, and short-term effects on stability.
arXiv Detail & Related papers (2024-07-28T03:40:00Z)
AI Hospital: Benchmarking Large Language Models in a Multi-agent Medical Interaction Simulator [69.51568871044454]
We introduce textbfAI Hospital, a framework simulating dynamic medical interactions between emphDoctor as player and NPCs. This setup allows for realistic assessments of LLMs in clinical scenarios. We develop the Multi-View Medical Evaluation benchmark, utilizing high-quality Chinese medical records and NPCs.
arXiv Detail & Related papers (2024-02-15T06:46:48Z)
Adaptive Multi-Agent Deep Reinforcement Learning for Timely Healthcare Interventions [17.405080523382235]
We propose a novel AI-driven patient monitoring framework using multi-agent deep reinforcement learning (DRL) Our approach deploys multiple learning agents, each dedicated to monitoring a specific physiological feature, such as heart rate, respiration, and temperature. We evaluate the performance of the proposed multi-agent DRL framework using real-world physiological and motion data from two datasets.
arXiv Detail & Related papers (2023-09-20T00:42:08Z)
An Automatic Evaluation Framework for Multi-turn Medical Consultations Capabilities of Large Language Models [22.409334091186995]
Large language models (LLMs) often suffer from hallucinations, leading to overly confident but incorrect judgments. This paper introduces an automated evaluation framework that assesses the practical capabilities of LLMs as virtual doctors during multi-turn consultations.
arXiv Detail & Related papers (2023-09-05T09:24:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.