Related papers: Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings

URL: http://arxiv.org/abs/2505.01711v1
Date: Sat, 03 May 2025 06:18:12 GMT
Title: Knowledge-Augmented Language Models Interpreting Structured Chest X-Ray Findings
Authors: Alexander Davis, Rafael Souza, Jia-Hao Lim,
Abstract summary: This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric language models for chest X-rays interpretation.<n>We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning.<n>Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities.
Score: 44.99833362998488
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated interpretation of chest X-rays (CXR) is a critical task with the potential to significantly improve clinical workflow and patient care. While recent advances in multimodal foundation models have shown promise, effectively leveraging the full power of large language models (LLMs) for this visual task remains an underexplored area. This paper introduces CXR-TextInter, a novel framework that repurposes powerful text-centric LLMs for CXR interpretation by operating solely on a rich, structured textual representation of the image content, generated by an upstream image analysis pipeline. We augment this LLM-centric approach with an integrated medical knowledge module to enhance clinical reasoning. To facilitate training and evaluation, we developed the MediInstruct-CXR dataset, containing structured image representations paired with diverse, clinically relevant instruction-response examples, and the CXR-ClinEval benchmark for comprehensive assessment across various interpretation tasks. Extensive experiments on CXR-ClinEval demonstrate that CXR-TextInter achieves state-of-the-art quantitative performance across pathology detection, report generation, and visual question answering, surpassing existing multimodal foundation models. Ablation studies confirm the critical contribution of the knowledge integration module. Furthermore, blinded human evaluation by board-certified radiologists shows a significant preference for the clinical quality of outputs generated by CXR-TextInter. Our work validates an alternative paradigm for medical image AI, showcasing the potential of harnessing advanced LLM capabilities when visual information is effectively structured and domain knowledge is integrated.

Related papers

Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning [27.49826980862286]
We propose an in-context learning framework called PathGenIC that integrates context derived from the training set with a multimodal in-context learning mechanism.<n>Our method dynamically retrieves semantically similar whole slide representations (WSI)-report pairs and incorporates adaptive feedback to enhance contextual relevance and generation quality.
arXiv Detail & Related papers (2025-06-21T08:56:45Z)
RadFabric: Agentic AI System with Reasoning Capability for Radiology [61.25593938175618]
RadFabric is a multi agent, multimodal reasoning framework that unifies visual and textual analysis for comprehensive CXR interpretation.<n>System employs specialized CXR agents for pathology detection, an Anatomical Interpretation Agent to map visual findings to precise anatomical structures, and a Reasoning Agent powered by large multimodal reasoning models to synthesize visual, anatomical, and clinical data into transparent and evidence based diagnoses.
arXiv Detail & Related papers (2025-06-17T03:10:33Z)
Bringing CLIP to the Clinic: Dynamic Soft Labels and Negation-Aware Learning for Medical Analysis [0.9944647907864256]
We propose a novel approach that integrates clinically-enhanced dynamic soft labels and medical graphical alignment.<n>Our approach is easily integrated into the medical CLIP training pipeline and achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-05-28T08:00:18Z)
Zeus: Zero-shot LLM Instruction for Union Segmentation in Multimodal Medical Imaging [4.341503087761129]
Conducting multimodal learning involves visual and text modalities shown as a solution, but collecting paired vision-language datasets is expensive and time-consuming.<n>Inspired by the superior ability in numerous cross-modal tasks for Large Language Models (LLMs), we proposed a novel Vision-LLM union framework to address the issues.
arXiv Detail & Related papers (2025-04-09T23:33:35Z)
Fake It Till You Make It: Using Synthetic Data and Domain Knowledge for Improved Text-Based Learning for LGE Detection [11.532639713283226]
We use strategies rooted in domain knowledge to train a model for LGE detection using text from clinical reports.<n>We standardize the orientation of the images in an anatomy-informed way to enable better alignment of spatial and text features.<n> ablation studies are carried out to elucidate the contributions of each design component to the overall performance of the model.
arXiv Detail & Related papers (2025-02-18T15:30:48Z)
A Generative Framework for Bidirectional Image-Report Understanding in Chest Radiography [1.2289361708127877]
Multi-Stage Adaptive Vision-Language Tuning (MAViLT) is a novel framework designed to enhance multimodal reasoning and generation for vision-based understanding.<n>MAViLT incorporates a clinical gradient-weighted tokenization process and a hierarchical fine-tuning strategy, enabling it to generate accurate radiology reports, synthesize realistic CXRs from text, and answer vision-based clinical questions.<n>We evaluate MAViLT on two benchmark datasets, MIMIC-CXR and Indiana University CXR, achieving state-of-the-art results across all tasks.
arXiv Detail & Related papers (2025-02-09T15:02:57Z)
Clinical Evaluation of Medical Image Synthesis: A Case Study in Wireless Capsule Endoscopy [63.39037092484374]
Synthetic Data Generation based on Artificial Intelligence (AI) can transform the way clinical medicine is delivered.<n>This study focuses on the clinical evaluation of medical SDG, with a proof-of-concept investigation on diagnosing Inflammatory Bowel Disease (IBD) using Wireless Capsule Endoscopy (WCE) images.<n>The results show that TIDE-II generates clinically plausible, very realistic WCE images, of improved quality compared to relevant state-of-the-art generative models.
arXiv Detail & Related papers (2024-10-31T19:48:50Z)
Visual Prompt Engineering for Vision Language Models in Radiology [0.17183214167143138]
Contrastive Language-Image Pretraining (CLIPP) offers a solution by enabling zero-shot classification through large-scale pretraining.<n>Visual markers improve AUROC2013$ by up to 0.185, highlighting their effectiveness enhancing classification performance.<n>We release our code and preprocessing pipeline, providing a reference point for future work on localized classification in medical imaging.
arXiv Detail & Related papers (2024-08-28T13:53:27Z)
MLVICX: Multi-Level Variance-Covariance Exploration for Chest X-ray Self-Supervised Representation Learning [6.4136876268620115]
MLVICX is an approach to capture rich representations in the form of embeddings from chest X-ray images. We demonstrate the performance of MLVICX in advancing self-supervised chest X-ray representation learning.
arXiv Detail & Related papers (2024-03-18T06:19:37Z)
Radiology Report Generation Using Transformers Conditioned with Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information. The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z)
LLM-driven Multimodal Target Volume Contouring in Radiation Oncology [46.23891509553877]
Large language models (LLMs) can facilitate the integration of the textural information and images. We present a novel LLM-driven multimodal AI, namely LLMSeg, that is applicable to the challenging task of target volume contouring for radiation therapy. We demonstrate that the proposed model exhibits markedly improved performance compared to conventional unimodal AI models.
arXiv Detail & Related papers (2023-11-03T13:38:42Z)
XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model. It can analyze and answer open-ended questions about chest radiographs. We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z)
Multi-task Paired Masking with Alignment Modeling for Medical Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework. We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.