Anatomy-Aware Conditional Image-Text Retrieval
- URL: http://arxiv.org/abs/2503.07456v1
- Date: Mon, 10 Mar 2025 15:36:49 GMT
- Title: Anatomy-Aware Conditional Image-Text Retrieval
- Authors: Meng Zheng, Jiajin Zhang, Benjamin Planche, Zhongpai Gao, Terrence Chen, Ziyan Wu,
- Abstract summary: Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases.<n>We propose an Anatomical Location-Conditioned Image-Text Retrieval framework, which aims to retrieve similar patient cases in the same anatomical region.<n>We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks.
- Score: 29.872292146073207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-Text Retrieval (ITR) finds broad applications in healthcare, aiding clinicians and radiologists by automatically retrieving relevant patient cases in the database given the query image and/or report, for more efficient clinical diagnosis and treatment, especially for rare diseases. However conventional ITR systems typically only rely on global image or text representations for measuring patient image/report similarities, which overlook local distinctiveness across patient cases. This often results in suboptimal retrieval performance. In this paper, we propose an Anatomical Location-Conditioned Image-Text Retrieval (ALC-ITR) framework, which, given a query image and the associated suspicious anatomical region(s), aims to retrieve similar patient cases exhibiting the same disease or symptoms in the same anatomical region. To perform location-conditioned multimodal retrieval, we learn a medical Relevance-Region-Aligned Vision Language (RRA-VL) model with semantic global-level and region-/word-level alignment to produce generalizable, well-aligned multi-modal representations. Additionally, we perform location-conditioned contrastive learning to further utilize cross-pair region-level contrastiveness for improved multi-modal retrieval. We show that our proposed RRA-VL achieves state-of-the-art localization performance in phase-grounding tasks, and satisfying multi-modal retrieval performance with or without location conditioning. Finally, we thoroughly investigate the generalizability and explainability of our proposed ALC-ITR system in providing explanations and preliminary diagnosis reports given retrieved patient cases (conditioned on anatomical regions), with proper off-the-shelf LLM prompts.
Related papers
- RL4Med-DDPO: Reinforcement Learning for Controlled Guidance Towards Diverse Medical Image Generation using Vision-Language Foundation Models [0.7165255458140439]
Vision-Language Foundation Models (VLFM) have shown a tremendous increase in performance in terms of generating high-resolution, photorealistic natural images.
We propose a multi-stage architecture where a pre-trained VLFM provides a cursory semantic understanding, while a reinforcement learning algorithm refines the alignment through an iterative process.
We demonstrate the effectiveness of our method on a medical imaging skin dataset where the generated images exhibit improved generation quality and alignment with prompt over the fine-tuned Stable Diffusion.
arXiv Detail & Related papers (2025-03-20T01:51:05Z) - RadIR: A Scalable Framework for Multi-Grained Medical Image Retrieval via Radiology Report Mining [48.21287619304126]
We propose a novel methodology that leverages dense radiology reports to define image-wise similarity ordering at multiple granularities.
We construct two comprehensive medical imaging retrieval datasets: MIMIC-IR for Chest X-rays and CTRATE-IR for CT scans.
We develop two retrieval systems, RadIR-CXR and model-ChestCT, which demonstrate superior performance in traditional image-image and image-report retrieval tasks.
arXiv Detail & Related papers (2025-03-06T17:43:03Z) - SeCo-INR: Semantically Conditioned Implicit Neural Representations for Improved Medical Image Super-Resolution [25.078280843551322]
Implicit Neural Representations (INRs) have recently advanced the field of deep learning due to their ability to learn continuous representations of signals.
We propose a novel framework, referred to as the Semantically Conditioned INR (SeCo-INR), that conditions an INR using local priors from a medical image.
Our framework learns a continuous representation of the semantic segmentation features of a medical image and utilizes it to derive the optimal INR for each semantic region of the image.
arXiv Detail & Related papers (2024-09-02T07:45:06Z) - Applying Conditional Generative Adversarial Networks for Imaging Diagnosis [3.881664394416534]
This study introduces an innovative application of Conditional Generative Adversarial Networks (C-GAN) integrated with Stacked Hourglass Networks (SHGN)
We address the problem of overfitting, common in deep learning models applied to complex imaging datasets, by augmenting data through rotation and scaling.
A hybrid loss function combining L1 and L2 reconstruction losses, enriched with adversarial training, is introduced to refine segmentation processes in intravascular ultrasound (IVUS) imaging.
arXiv Detail & Related papers (2024-07-17T23:23:09Z) - Multi-modality Regional Alignment Network for Covid X-Ray Survival Prediction and Report Generation [36.343753593390254]
This study proposes Multi-modality Regional Alignment Network (MRANet), an explainable model for radiology report generation and survival prediction.
MRANet visually grounds region-specific descriptions, providing robust anatomical regions with a completion strategy.
A cross LLMs alignment is employed to enhance the image-to-text transfer process, resulting in sentences rich with clinical detail and improved explainability for radiologist.
arXiv Detail & Related papers (2024-05-23T02:41:08Z) - Eye-gaze Guided Multi-modal Alignment for Medical Representation Learning [65.54680361074882]
Eye-gaze Guided Multi-modal Alignment (EGMA) framework harnesses eye-gaze data for better alignment of medical visual and textual features.
We conduct downstream tasks of image classification and image-text retrieval on four medical datasets.
arXiv Detail & Related papers (2024-03-19T03:59:14Z) - Anatomical Structure-Guided Medical Vision-Language Pre-training [21.68719061251635]
We propose an Anatomical Structure-Guided (ASG) framework for learning medical visual representations.
For anatomical region, we design an automatic anatomical region-sentence alignment paradigm in collaboration with radiologists.
For finding and existence, we regard them as image tags, applying an image-tag recognition decoder to associate image features with their respective tags within each sample.
arXiv Detail & Related papers (2024-03-14T11:29:47Z) - Radiology Report Generation Using Transformers Conditioned with
Non-imaging Data [55.17268696112258]
This paper proposes a novel multi-modal transformer network that integrates chest x-ray (CXR) images and associated patient demographic information.
The proposed network uses a convolutional neural network to extract visual features from CXRs and a transformer-based encoder-decoder network that combines the visual features with semantic text embeddings of patient demographic information.
arXiv Detail & Related papers (2023-11-18T14:52:26Z) - Implicit Neural Representation in Medical Imaging: A Comparative Survey [3.478921293603811]
Implicit neural representations (INRs) have gained prominence as a powerful paradigm in scene reconstruction and computer graphics.
This survey aims to provide a comprehensive overview of INR models in the field of medical imaging.
arXiv Detail & Related papers (2023-07-30T06:39:25Z) - Multi-task Paired Masking with Alignment Modeling for Medical
Vision-Language Pre-training [55.56609500764344]
We propose a unified framework based on Multi-task Paired Masking with Alignment (MPMA) to integrate the cross-modal alignment task into the joint image-text reconstruction framework.
We also introduce a Memory-Augmented Cross-Modal Fusion (MA-CMF) module to fully integrate visual information to assist report reconstruction.
arXiv Detail & Related papers (2023-05-13T13:53:48Z) - Region-based Contrastive Pretraining for Medical Image Retrieval with
Anatomic Query [56.54255735943497]
Region-based contrastive pretraining for Medical Image Retrieval (RegionMIR)
We introduce a novel Region-based contrastive pretraining for Medical Image Retrieval (RegionMIR)
arXiv Detail & Related papers (2023-05-09T16:46:33Z) - Vision-Language Modelling For Radiological Imaging and Reports In The
Low Data Regime [70.04389979779195]
This paper explores training medical vision-language models (VLMs) where the visual and language inputs are embedded into a common space.
We explore several candidate methods to improve low-data performance, including adapting generic pre-trained models to novel image and text domains.
Using text-to-image retrieval as a benchmark, we evaluate the performance of these methods with variable sized training datasets of paired chest X-rays and radiological reports.
arXiv Detail & Related papers (2023-03-30T18:20:00Z) - Few-shot Medical Image Segmentation using a Global Correlation Network
with Discriminative Embedding [60.89561661441736]
We propose a novel method for few-shot medical image segmentation.
We construct our few-shot image segmentor using a deep convolutional network trained episodically.
We enhance discriminability of deep embedding to encourage clustering of the feature domains of the same class.
arXiv Detail & Related papers (2020-12-10T04:01:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.