CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding
- URL: http://arxiv.org/abs/2502.20509v1
- Date: Thu, 27 Feb 2025 20:39:03 GMT
- Title: CoCa-CXR: Contrastive Captioners Learn Strong Temporal Structures for Chest X-Ray Vision-Language Understanding
- Authors: Yixiong Chen, Shawn Xu, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Shravya Shetty, Daniel Golden, Alan Yuille, Lin Yang,
- Abstract summary: Vision-language models have proven to be of great benefit for medical image analysis since they learn rich semantics from both images and reports.<n>We propose two components to address aligning progression descriptions with the semantics differences in image pairs.<n>CoCa-CXR incorporates a novel regional cross-attention module to identify local differences between paired CXR images.
- Score: 19.89997101064605
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-language models have proven to be of great benefit for medical image analysis since they learn rich semantics from both images and reports. Prior efforts have focused on better alignment of image and text representations to enhance image understanding. However, though explicit reference to a prior image is common in Chest X-Ray (CXR) reports, aligning progression descriptions with the semantics differences in image pairs remains under-explored. In this work, we propose two components to address this issue. (1) A CXR report processing pipeline to extract temporal structure. It processes reports with a large language model (LLM) to separate the description and comparison contexts, and extracts fine-grained annotations from reports. (2) A contrastive captioner model for CXR, namely CoCa-CXR, to learn how to both describe images and their temporal progressions. CoCa-CXR incorporates a novel regional cross-attention module to identify local differences between paired CXR images. Extensive experiments show the superiority of CoCa-CXR on both progression analysis and report generation compared to previous methods. Notably, on MS-CXR-T progression classification, CoCa-CXR obtains 65.0% average testing accuracy on five pulmonary conditions, outperforming the previous state-of-the-art (SOTA) model BioViL-T by 4.8%. It also achieves a RadGraph F1 of 24.2% on MIMIC-CXR, which is comparable to the Med-Gemini foundation model.
Related papers
- RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [48.21287619304126]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.
We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.
We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - Activating Associative Disease-Aware Vision Token Memory for LLM-Based X-ray Report Generation [54.631356899598956]
We propose a novel associative memory-enhanced X-ray report generation model that effectively mimics the process of professional doctors writing medical reports.
We employ a visual Hopfield network to establish memory associations for disease-related tokens, and a report Hopfield network to retrieve report memory information.
arXiv Detail & Related papers (2025-01-07T01:19:48Z) - EVOKE: Elevating Chest X-ray Report Generation via Multi-View Contrastive Learning and Patient-Specific Knowledge [21.596462896333733]
textbfEVOKE is a novel chest X-ray report generation framework that incorporates multi-view contrastive learning and patient-specific knowledge.
We present a knowledge-guided report generation module that integrates available patient-specific indications.
Our proposed EVOKE surpasses recent state-of-the-art methods across multiple datasets.
arXiv Detail & Related papers (2024-11-15T14:38:13Z) - TiBiX: Leveraging Temporal Information for Bidirectional X-ray and Report Generation [0.7381551917607596]
TiBiX: Leveraging Temporal information for Bidirectional X-ray and Report Generation.
We propose TiBiX: Leveraging Temporal information for Bidirectional X-ray and Report Generation.
arXiv Detail & Related papers (2024-03-20T07:00:03Z) - WoLF: Wide-scope Large Language Model Framework for CXR Understanding [8.265578494822087]
We introduce Wide-scope Large Language Model Framework for Chest X-ray understanding.
We capture multi-faceted records of patients, which are utilized for accurate diagnoses in real-world clinical scenarios.
arXiv Detail & Related papers (2024-03-19T06:39:23Z) - Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation [91.63262242041695]
We propose a novel Adaptive patch-word Matching (AdaMatch) model to correlate chest X-ray (CXR) image regions with words in medical reports.
AdaMatch exploits the fine-grained relation between adaptive patches and words to provide explanations of specific image regions with corresponding words.
In order to provide explicit explainability for CXR-report generation task, we propose an AdaMatch-based bidirectional large language model for Cyclic CXR-report generation.
arXiv Detail & Related papers (2023-12-13T11:47:28Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Improving Classification Model Performance on Chest X-Rays through Lung
Segmentation [63.45024974079371]
We propose a deep learning approach to enhance abnormal chest x-ray (CXR) identification performance through segmentations.
Our approach is designed in a cascaded manner and incorporates two modules: a deep neural network with criss-cross attention modules (XLSor) for localizing lung region in CXR images and a CXR classification model with a backbone of a self-supervised momentum contrast (MoCo) model pre-trained on large-scale CXR data sets.
arXiv Detail & Related papers (2022-02-22T15:24:06Z) - Auxiliary Signal-Guided Knowledge Encoder-Decoder for Medical Report
Generation [107.3538598876467]
We propose an Auxiliary Signal-Guided Knowledge-Decoder (ASGK) to mimic radiologists' working patterns.
ASGK integrates internal visual feature fusion and external medical linguistic information to guide medical knowledge transfer and learning.
arXiv Detail & Related papers (2020-06-06T01:00:15Z) - Show, Describe and Conclude: On Exploiting the Structure Information of
Chest X-Ray Reports [5.6070625920019825]
Chest X-Ray (CXR) images are commonly used for clinical screening and diagnosis.
The complex structures between and within sections of the reports pose a great challenge to the automatic report generation.
We propose a novel framework that exploits the structure information between and within report sections for generating CXR imaging reports.
arXiv Detail & Related papers (2020-04-26T02:29:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.