Can Modern LLMs Act as Agent Cores in Radiology Environments?
- URL: http://arxiv.org/abs/2412.09529v2
- Date: Thu, 19 Dec 2024 03:05:27 GMT
- Title: Can Modern LLMs Act as Agent Cores in Radiology Environments?
- Authors: Qiaoyu Zheng, Chaoyi Wu, Pengcheng Qiu, Lisong Dai, Ya Zhang, Yanfeng Wang, Weidi Xie,
- Abstract summary: Large language models (LLMs) offer enhanced accuracy and interpretability across various domains.<n>This paper aims to investigate the pre-requisite question for building concrete radiology agents.<n>We present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents.<n>Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow.
- Score: 54.36730060680139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advancements in large language models (LLMs) have paved the way for LLM-based agent systems that offer enhanced accuracy and interpretability across various domains. Radiology, with its complex analytical requirements, is an ideal field for the application of these agents. This paper aims to investigate the pre-requisite question for building concrete radiology agents which is, `Can modern LLMs act as agent cores in radiology environments?' To investigate it, we introduce RadABench with three-fold contributions: First, we present RadABench-Data, a comprehensive synthetic evaluation dataset for LLM-based agents, generated from an extensive taxonomy encompassing 6 anatomies, 5 imaging modalities, 10 tool categories, and 11 radiology tasks. Second, we propose RadABench-EvalPlat, a novel evaluation platform for agents featuring a prompt-driven workflow and the capability to simulate a wide range of radiology toolsets. Third, we assess the performance of 7 leading LLMs on our benchmark from 5 perspectives with multiple metrics. Our findings indicate that while current LLMs demonstrate strong capabilities in many areas, they are still not sufficiently advanced to serve as the central agent core in a fully operational radiology agent system. Additionally, we identify key factors influencing the performance of LLM-based agent cores, offering insights for clinicians on how to apply agent systems in real-world radiology practices effectively. All of our code and data are open-sourced in https://github.com/MAGIC-AI4Med/RadABench.
Related papers
- ChestX-Reasoner: Advancing Radiology Foundation Models with Reasoning through Step-by-Step Verification [57.22053411719822]
ChestX-Reasoner is a radiology diagnosis MLLM designed to leverage process supervision mined directly from clinical reports.
Our two-stage training framework combines supervised fine-tuning and reinforcement learning guided by process rewards to better align model reasoning with clinical standards.
arXiv Detail & Related papers (2025-04-29T16:48:23Z) - MedAgentBench: A Realistic Virtual EHR Environment to Benchmark Medical LLM Agents [20.96732566767587]
Recent large language models (LLMs) have demonstrated significant advancements, particularly in their ability to serve as agents.
We introduce MedAgentBench, a broad evaluation suite designed to assess the agent capabilities of large language models within medical records contexts.
The environment uses the standard APIs and communication infrastructure used in modern EMR systems, so it can be easily migrated into live EMR systems.
arXiv Detail & Related papers (2025-01-24T17:21:01Z) - Enhancing LLMs for Impression Generation in Radiology Reports through a Multi-Agent System [10.502391082887568]
"RadCouncil" is a multi-agent Large Language Model (LLM) framework designed to enhance the generation of impressions in radiology reports from the finding section.<n>RadCouncil comprises three specialized agents: 1) a "Retrieval" Agent that identifies and retrieves similar reports from a vector database, 2) a "Radiologist" Agent that generates impressions based on the finding section of the given report plus the exemplar reports retrieved by the Retrieval Agent, and 3) a "Reviewer" Agent that evaluates the generated impressions and provides feedback.
arXiv Detail & Related papers (2024-12-06T21:33:03Z) - Medchain: Bridging the Gap Between LLM Agents and Clinical Practice through Interactive Sequential Benchmarking [58.25862290294702]
We present MedChain, a dataset of 12,163 clinical cases that covers five key stages of clinical workflow.<n>We also propose MedChain-Agent, an AI system that integrates a feedback mechanism and a MCase-RAG module to learn from previous cases and adapt its responses.
arXiv Detail & Related papers (2024-12-02T15:25:02Z) - Best Practices for Large Language Models in Radiology [4.972411560978282]
Nuanced application of language is key for various activities.<n>The emergence of large language models (LLMs) offers an opportunity to improve the management and interpretation of the vast data in radiology.
arXiv Detail & Related papers (2024-12-02T07:54:55Z) - Comprehensive and Practical Evaluation of Retrieval-Augmented Generation Systems for Medical Question Answering [70.44269982045415]
Retrieval-augmented generation (RAG) has emerged as a promising approach to enhance the performance of large language models (LLMs)
We introduce Medical Retrieval-Augmented Generation Benchmark (MedRGB) that provides various supplementary elements to four medical QA datasets.
Our experimental results reveals current models' limited ability to handle noise and misinformation in the retrieved documents.
arXiv Detail & Related papers (2024-11-14T06:19:18Z) - Enhanced Electronic Health Records Text Summarization Using Large Language Models [0.0]
This project builds on prior work by creating a system that generates clinician-preferred, focused summaries.
The proposed system leverages the Flan-T5 model to generate tailored EHR summaries based on clinician-specified topics.
arXiv Detail & Related papers (2024-10-12T19:36:41Z) - AIPatient: Simulating Patients with EHRs and LLM Powered Agentic Workflow [33.8495939261319]
We develop an advanced simulated patient system with AIPatient Knowledge Graph (AIPatient KG) as the input and Reasoning Retrieval-Augmented Generation (Reasoning RAG) as the generation backbone.
Reasoning RAG leverages six LLM powered agents spanning tasks including retrieval, KG query generation, abstraction, checker, rewrite, and summarization.
Our system also presents high readability (median Flesch Reading Ease 77.23; median Flesch Kincaid Grade 5.6), robustness (ANOVA F-value 0.6126, p>0.1), and stability (ANOVA F-value 0.782, p>0.1)
arXiv Detail & Related papers (2024-09-27T17:17:15Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.
Existing direct preference learning algorithms are originally designed for the single-turn chat task.
We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - MGH Radiology Llama: A Llama 3 70B Model for Radiology [50.42811030970618]
This paper presents an advanced radiology-focused large language model: MGH Radiology Llama.
It is developed using the Llama 3 70B model, building upon previous domain-specific models like Radiology-GPT and Radiology-Llama2.
Our evaluation, incorporating both traditional metrics and a GPT-4-based assessment, highlights the enhanced performance of this work over general-purpose LLMs.
arXiv Detail & Related papers (2024-08-13T01:30:03Z) - Towards Generalizable Agents in Text-Based Educational Environments: A Study of Integrating RL with LLMs [22.568925103893182]
We aim to enhance the generalization capabilities of agents in open-ended text-based learning environments by integrating Reinforcement Learning (RL) with Large Language Models (LLMs)
We introduce PharmaSimText, a novel benchmark derived from the PharmaSim virtual pharmacy environment designed for practicing diagnostic conversations.
Our results show that RL-based agents excel in task completion but lack in asking quality diagnostic questions.
arXiv Detail & Related papers (2024-04-29T14:53:48Z) - Exploring LLM Multi-Agents for ICD Coding [15.730751450511333]
The proposed multi-agent method for ICD coding effectively mimics the real-world coding process and improves performance on both common and rare codes.
Our method achieves comparable results to state-of-the-art ICD coding methods that require extensive pre-training or fine-tuning, and outperforms them in rare code accuracy, and explainability.
arXiv Detail & Related papers (2024-04-01T15:17:39Z) - Towards a clinically accessible radiology foundation model: open-access and lightweight, with automated evaluation [113.5002649181103]
Training open-source small multimodal models (SMMs) to bridge competency gaps for unmet clinical needs in radiology.
For training, we assemble a large dataset of over 697 thousand radiology image-text pairs.
For evaluation, we propose CheXprompt, a GPT-4-based metric for factuality evaluation, and demonstrate its parity with expert evaluation.
The inference of LlaVA-Rad is fast and can be performed on a single V100 GPU in private settings, offering a promising state-of-the-art tool for real-world clinical applications.
arXiv Detail & Related papers (2024-03-12T18:12:02Z) - End-to-End Breast Cancer Radiotherapy Planning via LMMs with Consistency Embedding [47.360760580820966]
We present RO-LMM, a comprehensive large multimodal model (LMM) tailored for the field of radiation oncology.
This model effectively manages a series of tasks within the clinical workflow, including clinical context summarization, radiation treatment plan suggestion, and plan-guided target volume segmentation.
We present a novel Consistency Embedding Fine-Tuning (CEFTune) technique, which boosts LMM's robustness to noisy inputs while preserving the consistency of handling clean inputs.
arXiv Detail & Related papers (2023-11-27T14:49:06Z) - ChatRadio-Valuer: A Chat Large Language Model for Generalizable
Radiology Report Generation Based on Multi-institution and Multi-system Data [115.0747462486285]
ChatRadio-Valuer is a tailored model for automatic radiology report generation that learns generalizable representations.
The clinical dataset utilized in this study encompasses a remarkable total of textbf332,673 observations.
ChatRadio-Valuer consistently outperforms state-of-the-art models, especially ChatGPT (GPT-3.5-Turbo) and GPT-4 et al.
arXiv Detail & Related papers (2023-10-08T17:23:17Z) - An Iterative Optimizing Framework for Radiology Report Summarization with ChatGPT [80.33783969507458]
The 'Impression' section of a radiology report is a critical basis for communication between radiologists and other physicians.
Recent studies have achieved promising results in automatic impression generation using large-scale medical text data.
These models often require substantial amounts of medical text data and have poor generalization performance.
arXiv Detail & Related papers (2023-04-17T17:13:42Z) - Preoperative brain tumor imaging: models and software for segmentation
and standardized reporting [0.0]
We investigate glioblastomas, lower grade gliomas, meningiomas, and metastases through four cohorts of up to 4000 patients.
Tumor segmentation models were trained using the AGU-Net architecture with different preprocessing steps and protocols.
Two software solutions have been developed, enabling an easy use of the trained models and standardized clinical reports.
arXiv Detail & Related papers (2022-04-29T16:29:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.