Related papers: An Agentic AI System for Multi-Framework Communication Coding

An Agentic AI System for Multi-Framework Communication Coding

URL: http://arxiv.org/abs/2512.08659v1
Date: Tue, 09 Dec 2025 14:46:16 GMT
Title: An Agentic AI System for Multi-Framework Communication Coding
Authors: Bohao Yang, Rui Yang, Joshua M. Biro, Haoyuan Wang, Jessica L. Handley, Brianna Richardson, Sophia Bessias, Nicoleta Economou-Zavlanos, Armando D. Bedoya, Monica Agrawal, Michael M. Zavlanos, Anand Chowdhury, Raj M. Ratwani, Kai Sun, Kathryn I. Pollak, Michael J. Pencina, Chuan Hong,
Abstract summary: We developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC)<n>MOSAIC is built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning.<n>To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders.
Score: 17.846847341760675
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Clinical communication is central to patient outcomes, yet large-scale human annotation of patient-provider conversation remains labor-intensive, inconsistent, and difficult to scale. Existing approaches based on large language models typically rely on single-task models that lack adaptability, interpretability, and reliability, especially when applied across various communication frameworks and clinical domains. In this study, we developed a Multi-framework Structured Agentic AI system for Clinical Communication (MOSAIC), built on a LangGraph-based architecture that orchestrates four core agents, including a Plan Agent for codebook selection and workflow planning, an Update Agent for maintaining up-to-date retrieval databases, a set of Annotation Agents that applies codebook-guided retrieval-augmented generation (RAG) with dynamic few-shot prompting, and a Verification Agent that provides consistency checks and feedback. To evaluate performance, we compared MOSAIC outputs against gold-standard annotations created by trained human coders. We developed and evaluated MOSAIC using 26 gold standard annotated transcripts for training and 50 transcripts for testing, spanning rheumatology and OB/GYN domains. On the test set, MOSAIC achieved an overall F1 score of 0.928. Performance was highest in the Rheumatology subset (F1 = 0.962) and strongest for Patient Behavior (e.g., patients asking questions, expressing preferences, or showing assertiveness). Ablations revealed that MOSAIC outperforms baseline benchmarking.

Related papers

MedDialogRubrics: A Comprehensive Benchmark and Evaluation Framework for Multi-turn Medical Consultations in Large Language Models [15.91764739198419]
We present MedDialogRubrics, a novel benchmark comprising 5,200 synthetically constructed patient cases and over 60,000 fine-grained evaluation rubrics.<n>Our framework employs a multi-agent system to synthesize realistic patient records and chief complaints without accessing real-world electronic health records.
arXiv Detail & Related papers (2026-01-06T13:56:33Z)
DispatchMAS: Fusing taxonomy and artificial intelligence agents for emergency medical services [49.70819009392778]
Large Language Models (LLMs) and Multi-Agent Systems (MAS) offer opportunities to augment dispatchers.<n>This study aimed to develop and evaluate a taxonomy-grounded, multi-agent system for simulating realistic scenarios.
arXiv Detail & Related papers (2025-10-24T08:01:21Z)
Retrieval-Augmented Guardrails for AI-Drafted Patient-Portal Messages: Error Taxonomy Construction and Large-Scale Evaluation [5.555479009357263]
Asynchronous patient-clinician messaging via EHR portals is a growing source of clinician workload.<n>Our contributions are threefold: (1) we introduce a clinically grounded error ontology comprising 5 domains and 59 granular error codes; (2) we develop a retrieval-augmented evaluation pipeline; and (3) we provide a two-stage prompting architecture using DSPy to enable scalable, interpretable, and hierarchical error detection.
arXiv Detail & Related papers (2025-09-26T16:42:43Z)
Automated Clinical Problem Detection from SOAP Notes using a Collaborative Multi-Agent LLM Architecture [8.072932739333309]
We introduce a collaborative multi-agent system (MAS) that models a clinical consultation team to address this gap.<n>The system is tasked with identifying clinical problems by analyzing only the Subjective (S) and Objective (O) sections of SOAP notes.<n>A Manager agent orchestrates a dynamically assigned team of specialist agents who engage in a hierarchical, iterative debate to reach a consensus.
arXiv Detail & Related papers (2025-08-29T17:31:24Z)
MedQARo: A Large-Scale Benchmark for Medical Question Answering in Romanian [50.767415194856135]
We introduce MedQARo, the first large-scale medical QA benchmark in Romanian.<n>We construct a high-quality and large-scale dataset comprising 102,646 QA pairs related to cancer patients.
arXiv Detail & Related papers (2025-08-22T13:48:37Z)
AMRG: Extend Vision Language Models for Automatic Mammography Report Generation [4.366802575084445]
Mammography report generation is a critical yet underexplored task in medical AI.<n>We introduce AMRG, the first end-to-end framework for generating narrative mammography reports.<n>We train and evaluate AMRG on DMID, a publicly available dataset of paired high-resolution mammograms and diagnostic reports.
arXiv Detail & Related papers (2025-08-12T06:37:41Z)
MedAgentBoard: Benchmarking Multi-Agent Collaboration with Conventional Methods for Diverse Medical Tasks [27.717720332927296]
We introduce MedAgentBoard, a comprehensive benchmark for the systematic evaluation of multi-agent collaboration, single-LLM, and conventional approaches.<n> MedAgentBoard encompasses four diverse medical task categories: medical (visual) question answering, lay summary generation, structured Electronic Health Record (EHR) predictive modeling, and clinical workflow automation.<n>Our extensive experiments reveal a nuanced landscape: while multi-agent collaboration demonstrates benefits in specific scenarios, it does not consistently outperform advanced single LLMs.
arXiv Detail & Related papers (2025-05-18T11:28:17Z)
TAMA: A Human-AI Collaborative Thematic Analysis Framework Using Multi-Agent LLMs for Clinical Interviews [54.35097932763878]
Thematic analysis (TA) is a widely used qualitative approach for uncovering latent meanings in unstructured text data.<n>Here, we propose TAMA: A Human-AI Collaborative Thematic Analysis framework using Multi-Agent LLMs for clinical interviews.<n>We demonstrate that TAMA outperforms existing LLM-assisted TA approaches, achieving higher thematic hit rate, coverage, and distinctiveness.
arXiv Detail & Related papers (2025-03-26T15:58:16Z)
LlaMADRS: Prompting Large Language Models for Interview-Based Depression Assessment [75.44934940580112]
This study introduces LlaMADRS, a novel framework leveraging open-source Large Language Models (LLMs) to automate depression severity assessment.<n>We employ a zero-shot prompting strategy with carefully designed cues to guide the model in interpreting and scoring transcribed clinical interviews.<n>Our approach, tested on 236 real-world interviews, demonstrates strong correlations with clinician assessments.
arXiv Detail & Related papers (2025-01-07T08:49:04Z)
Self-supervised Answer Retrieval on Clinical Notes [68.87777592015402]
We introduce CAPR, a rule-based self-supervision objective for training Transformer language models for domain-specific passage matching. We apply our objective in four Transformer-based architectures: Contextual Document Vectors, Bi-, Poly- and Cross-encoders. We report that CAPR outperforms strong baselines in the retrieval of domain-specific passages and effectively generalizes across rule-based and human-labeled passages.
arXiv Detail & Related papers (2021-08-02T10:42:52Z)
MedDG: An Entity-Centric Medical Consultation Dataset for Entity-Aware Medical Dialogue Generation [86.38736781043109]
We build and release a large-scale high-quality Medical Dialogue dataset related to 12 types of common Gastrointestinal diseases named MedDG. We propose two kinds of medical dialogue tasks based on MedDG dataset. One is the next entity prediction and the other is the doctor response generation. Experimental results show that the pre-train language models and other baselines struggle on both tasks with poor performance in our dataset.
arXiv Detail & Related papers (2020-10-15T03:34:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.