Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
- URL: http://arxiv.org/abs/2411.19378v2
- Date: Sun, 16 Feb 2025 17:29:23 GMT
- Title: Libra: Leveraging Temporal Images for Biomedical Radiology Analysis
- Authors: Xi Zhang, Zaiqiao Meng, Jake Lever, Edmond S. L. Ho,
- Abstract summary: Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation.<n>In this paper, we introduce Libra, a temporal-aware MLLM tailored for chest X-ray report generation.<n>Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (TAC), designed to accurately capture and integrate temporal differences between paired current and prior images.
- Score: 21.772106685777995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Radiology report generation (RRG) requires advanced medical image analysis, effective temporal reasoning, and accurate text generation. While multimodal large language models (MLLMs) align with pre-trained vision encoders to enhance visual-language understanding, most existing methods rely on single-image analysis or rule-based heuristics to process multiple images, failing to fully leverage temporal information in multi-modal medical datasets. In this paper, we introduce Libra, a temporal-aware MLLM tailored for chest X-ray report generation. Libra combines a radiology-specific image encoder with a novel Temporal Alignment Connector (TAC), designed to accurately capture and integrate temporal differences between paired current and prior images. Extensive experiments on the MIMIC-CXR dataset demonstrate that Libra establishes a new state-of-the-art benchmark among similarly scaled MLLMs, setting new standards in both clinical relevance and lexical accuracy.
Related papers
- MicarVLMoE: A Modern Gated Cross-Aligned Vision-Language Mixture of Experts Model for Medical Image Captioning and Report Generation [4.760537994346813]
Medical image reporting aims to generate structured clinical descriptions from radiological images.
We propose MicarVLMoE, a vision-language mixture-of-experts model with gated cross-aligned fusion.
We extend MIR to CT scans, retinal imaging, MRI scans, and gross pathology images, reporting state-of-the-art results.
arXiv Detail & Related papers (2025-04-29T01:26:02Z) - SeLIP: Similarity Enhanced Contrastive Language Image Pretraining for Multi-modal Head MRI [6.714491893348051]
We propose to develop a foundation model for multi-model head MRI by using contrastive learning on the images and the corresponding radiology findings.
Our proposed similarity enhanced contrastive language image pretraining (SeLIP) is able to effectively extract more useful features.
arXiv Detail & Related papers (2025-03-25T16:09:45Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [48.21287619304126]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.
We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.
We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - Enhanced Contrastive Learning with Multi-view Longitudinal Data for Chest X-ray Report Generation [15.257119888131609]
We propose enhanced contrastive learning with Multi-view Longitudinal data to facilitate chest X-ray Report Generation, named MLRG.
Specifically, we introduce a multi-view longitudinal contrast learning method that integrates spatial information from current multi-view images and temporal information from longitudinal data.
We present a tokenized absence encoding technique to handle missing patient-specific prior knowledge, allowing the model to produce more accurate radiology reports based on available prior knowledge.
arXiv Detail & Related papers (2025-02-27T12:59:04Z) - HC-LLM: Historical-Constrained Large Language Models for Radiology Report Generation [89.3260120072177]
We propose a novel Historical-Constrained Large Language Models (HC-LLM) framework for Radiology report generation.
Our approach extracts both time-shared and time-specific features from longitudinal chest X-rays and diagnostic reports to capture disease progression.
Notably, our approach performs well even without historical data during testing and can be easily adapted to other multimodal large models.
arXiv Detail & Related papers (2024-12-15T06:04:16Z) - MRGen: Segmentation Data Engine For Underrepresented MRI Modalities [59.61465292965639]
Training medical image segmentation models for rare yet clinically significant imaging modalities is challenging due to the scarcity of annotated data.
This paper investigates leveraging generative models to synthesize training data, to train segmentation models for underrepresented modalities.
arXiv Detail & Related papers (2024-12-04T16:34:22Z) - M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation [0.0]
M4CXR is a multi-modal large language model (LLMs) designed to enhance chest X-ray (CXR) interpretation.
The model supports multiple tasks such as medical report generation (MRG), visual grounding, and visual question answering (VQA)
M4CXR achieves state-of-the-art clinical accuracy in MRG by employing a chain-of-thought prompting strategy.
arXiv Detail & Related papers (2024-08-29T02:12:58Z) - Unlocking the Power of Spatial and Temporal Information in Medical Multimodal Pre-training [99.2891802841936]
We introduce the Med-ST framework for fine-grained spatial and temporal modeling.
For spatial modeling, Med-ST employs the Mixture of View Expert (MoVE) architecture to integrate different visual features from both frontal and lateral views.
For temporal modeling, we propose a novel cross-modal bidirectional cycle consistency objective by forward mapping classification (FMC) and reverse mapping regression (RMR)
arXiv Detail & Related papers (2024-05-30T03:15:09Z) - MLIP: Medical Language-Image Pre-training with Masked Local
Representation Learning [20.33625985769796]
Existing contrastive language-image pre-training aims to learn a joint representation by matching abundant image-text pairs.
We propose a Medical Language-Image Pre-training framework, which exploits the limited image-text medical data more efficiently.
Our evaluation results show that MLIP outperforms previous work in zero/few-shot classification and few-shot segmentation tasks by a large margin.
arXiv Detail & Related papers (2024-01-03T07:54:13Z) - VALD-MD: Visual Attribution via Latent Diffusion for Medical Diagnostics [0.0]
Visual attribution in medical imaging seeks to make evident the diagnostically-relevant components of a medical image.
We here present a novel generative visual attribution technique, one that leverages latent diffusion models in combination with domain-specific large language models.
The resulting system also exhibits a range of latent capabilities including zero-shot localized disease induction.
arXiv Detail & Related papers (2024-01-02T19:51:49Z) - Generalized Implicit Neural Representation for Efficient MRI Parallel Imaging Reconstruction [16.63720411275398]
This study presents a generalized implicit neural representation (INR)-based framework for MRI PI reconstruction.
The framework's INR model treats fully sampled MR images as a continuous function of spatial coordinates and prior voxel-specific features.
Experiments on publicly available MRI datasets demonstrate the superior performance of the proposed method in reconstructing images at multiple acceleration factors.
arXiv Detail & Related papers (2023-09-12T09:07:03Z) - On Sensitivity and Robustness of Normalization Schemes to Input
Distribution Shifts in Automatic MR Image Diagnosis [58.634791552376235]
Deep Learning (DL) models have achieved state-of-the-art performance in diagnosing multiple diseases using reconstructed images as input.
DL models are sensitive to varying artifacts as it leads to changes in the input data distribution between the training and testing phases.
We propose to use other normalization techniques, such as Group Normalization and Layer Normalization, to inject robustness into model performance against varying image artifacts.
arXiv Detail & Related papers (2023-06-23T03:09:03Z) - XrayGPT: Chest Radiographs Summarization using Medical Vision-Language
Models [60.437091462613544]
We introduce XrayGPT, a novel conversational medical vision-language model.
It can analyze and answer open-ended questions about chest radiographs.
We generate 217k interactive and high-quality summaries from free-text radiology reports.
arXiv Detail & Related papers (2023-06-13T17:59:59Z) - LLM-CXR: Instruction-Finetuned LLM for CXR Image Understanding and Generation [51.08810811457617]
vision-language alignment in LLMs is actively being researched to enable multimodal reasoning and visual IO.
We develop a method for instruction-tuning an LLM only on text to gain vision-language capabilities for medical images.
Our model, LLM-CXR, trained in this approach shows better image-text alignment in both CXR understanding and generation tasks.
arXiv Detail & Related papers (2023-05-19T07:44:39Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Learning to Exploit Temporal Structure for Biomedical Vision-Language
Processing [53.89917396428747]
Self-supervised learning in vision-language processing exploits semantic alignment between imaging and text modalities.
We explicitly account for prior images and reports when available during both training and fine-tuning.
Our approach, named BioViL-T, uses a CNN-Transformer hybrid multi-image encoder trained jointly with a text model.
arXiv Detail & Related papers (2023-01-11T16:35:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.