Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
- URL: http://arxiv.org/abs/2404.12920v4
- Date: Thu, 30 Jan 2025 16:31:27 GMT
- Title: Zero-Shot Medical Phrase Grounding with Off-the-shelf Diffusion Models
- Authors: Konstantinos Vilouras, Pedro Sanchez, Alison Q. O'Neil, Sotirios A. Tsaftaris,
- Abstract summary: The task of performing localization with textual guidance is commonly referred to as phrase grounding.<n>We use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task.<n>Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology.
- Score: 12.264115733611058
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Localizing the exact pathological regions in a given medical scan is an important imaging problem that traditionally requires a large amount of bounding box ground truth annotations to be accurately solved. However, there exist alternative, potentially weaker, forms of supervision, such as accompanying free-text reports, which are readily available. The task of performing localization with textual guidance is commonly referred to as phrase grounding. In this work, we use a publicly available Foundation Model, namely the Latent Diffusion Model, to perform this challenging task. This choice is supported by the fact that the Latent Diffusion Model, despite being generative in nature, contains cross-attention mechanisms that implicitly align visual and textual features, thus leading to intermediate representations that are suitable for the task at hand. In addition, we aim to perform this task in a zero-shot manner, i.e., without any training on the target task, meaning that the model's weights remain frozen. To this end, we devise strategies to select features and also refine them via post-processing without extra learnable parameters. We compare our proposed method with state-of-the-art approaches which explicitly enforce image-text alignment in a joint embedding space via contrastive learning. Results on a popular chest X-ray benchmark indicate that our method is competitive with SOTA on different types of pathology, and even outperforms them on average in terms of two metrics (mean IoU and AUC-ROC). Source code will be released upon acceptance at https://github.com/vios-s.
Related papers
- Generate to Ground: Multimodal Text Conditioning Boosts Phrase Grounding in Medical Vision-Language Models [6.408114351192012]
We show that generative text-to-image diffusion models can achieve superior zero-shot phrase grounding performance.<n>Results establish generative approaches as a more effective paradigm for phrase grounding in the medical imaging domain.
arXiv Detail & Related papers (2025-07-16T13:48:32Z) - MedGround-R1: Advancing Medical Image Grounding via Spatial-Semantic Rewarded Group Relative Policy Optimization [19.70803794316208]
Medical Image Grounding (MIG) involves localizing specific regions in medical images based on textual descriptions.<n>Existing Vision-Language Models (VLMs) for MIG often rely on Supervised Fine-Tuning (SFT) with large amounts of Chain-of-Thought (CoT) reasoning annotations.<n>We propose the Spatial-Semantic Rewarded Group Relative Policy Optimization to train the model without CoT reasoning annotations.
arXiv Detail & Related papers (2025-07-01T21:51:42Z) - Anatomy-Grounded Weakly Supervised Prompt Tuning for Chest X-ray Latent Diffusion Models [8.94567513238762]
We show that a standard text-conditioned Latent Diffusion Model has not learned to align clinically relevant information in free-text radiology reports with the corresponding areas of the given scan.<n>We propose a fine-tuning framework to improve multi-modal alignment in a pre-trained model such that it can be efficiently repurposed for downstream tasks such as phrase grounding.
arXiv Detail & Related papers (2025-06-12T12:19:18Z) - PLADIS: Pushing the Limits of Attention in Diffusion Models at Inference Time by Leveraging Sparsity [9.092404060771306]
Diffusion models have shown impressive results in generating high-quality conditional samples.
However, existing methods often require additional training or neural function evaluations (NFEs)
We propose a novel and efficient method, termed PLADIS, which boosts pre-trained models by leveraging sparse attention.
arXiv Detail & Related papers (2025-03-10T07:23:19Z) - Mediffusion: Joint Diffusion for Self-Explainable Semi-Supervised Classification and Medical Image Generation [3.046689922445082]
We introduce Mediffusion -- a new method for semi-supervised learning with explainable classification based on a joint diffusion model.
We show that our Mediffusion achieves results comparable to recent semi-supervised methods while providing more reliable and precise explanations.
arXiv Detail & Related papers (2024-11-12T23:14:36Z) - Unleashing the Potential of the Diffusion Model in Few-shot Semantic Segmentation [56.87049651707208]
Few-shot Semantic has evolved into In-context tasks, morphing into a crucial element in assessing generalist segmentation models.
Our initial focus lies in understanding how to facilitate interaction between the query image and the support image, resulting in the proposal of a KV fusion method within the self-attention framework.
Based on our analysis, we establish a simple and effective framework named DiffewS, maximally retaining the original Latent Diffusion Model's generative framework.
arXiv Detail & Related papers (2024-10-03T10:33:49Z) - Information Theoretic Text-to-Image Alignment [49.396917351264655]
We present a novel method that relies on an information-theoretic alignment measure to steer image generation.
Our method is on-par or superior to the state-of-the-art, yet requires nothing but a pre-trained denoising network to estimate MI.
arXiv Detail & Related papers (2024-05-31T12:20:02Z) - Uncertainty-aware Medical Diagnostic Phrase Identification and Grounding [72.18719355481052]
We introduce a novel task called Medical Report Grounding (MRG)<n>MRG aims to directly identify diagnostic phrases and their corresponding grounding boxes from medical reports in an end-to-end manner.<n>We propose uMedGround, a robust and reliable framework that leverages a multimodal large language model to predict diagnostic phrases.
arXiv Detail & Related papers (2024-04-10T07:41:35Z) - Diffusion based Zero-shot Medical Image-to-Image Translation for Cross Modality Segmentation [18.895926089773177]
Cross-modality image segmentation aims to segment the target modalities using a method designed in the source modality.
Deep generative models can translate the target modality images into the source modality, thus enabling cross-modality segmentation.
arXiv Detail & Related papers (2024-04-01T13:23:04Z) - FreeSeg-Diff: Training-Free Open-Vocabulary Segmentation with Diffusion Models [56.71672127740099]
We focus on the task of image segmentation, which is traditionally solved by training models on closed-vocabulary datasets.
We leverage different and relatively small-sized, open-source foundation models for zero-shot open-vocabulary segmentation.
Our approach (dubbed FreeSeg-Diff), which does not rely on any training, outperforms many training-based approaches on both Pascal VOC and COCO datasets.
arXiv Detail & Related papers (2024-03-29T10:38:25Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - CrossEAI: Using Explainable AI to generate better bounding boxes for
Chest X-ray images [0.0]
In medical imaging diagnosis, disease classification usually achieves high accuracy, but generated bounding boxes have much lower Intersection over Union (IoU)
Previous work shows that bounding boxes generated by these methods are usually larger than ground truth and contain major non-disease area.
This paper utilizes the advantages of post-hoc AI explainable methods to generate bounding boxes for chest x-ray image diagnosis.
arXiv Detail & Related papers (2023-10-29T17:48:39Z) - R&B: Region and Boundary Aware Zero-shot Grounded Text-to-image
Generation [74.5598315066249]
We probe into zero-shot grounded T2I generation with diffusion models.
We propose a Region and Boundary (R&B) aware cross-attention guidance approach.
arXiv Detail & Related papers (2023-10-13T05:48:42Z) - Introducing Shape Prior Module in Diffusion Model for Medical Image
Segmentation [7.7545714516743045]
We propose an end-to-end framework called VerseDiff-UNet, which leverages the denoising diffusion probabilistic model (DDPM)
Our approach integrates the diffusion model into a standard U-shaped architecture.
We evaluate our method on a single dataset of spine images acquired through X-ray imaging.
arXiv Detail & Related papers (2023-09-12T03:05:00Z) - Phasic Content Fusing Diffusion Model with Directional Distribution
Consistency for Few-Shot Model Adaption [73.98706049140098]
We propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss.
Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large.
Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation.
arXiv Detail & Related papers (2023-09-07T14:14:11Z) - Distill-SODA: Distilling Self-Supervised Vision Transformer for
Source-Free Open-Set Domain Adaptation in Computational Pathology [12.828728138651266]
Development of computational pathology models is essential for reducing manual tissue typing from whole slide images.
We propose a practical setting by addressing the above-mentioned challenges in one fell swoop, i.e., source-free open-set domain adaptation.
Our methodology focuses on adapting a pre-trained source model to an unlabeled target dataset.
arXiv Detail & Related papers (2023-07-10T14:36:51Z) - Discffusion: Discriminative Diffusion Models as Few-shot Vision and Language Learners [88.07317175639226]
We propose a novel approach, Discriminative Stable Diffusion (DSD), which turns pre-trained text-to-image diffusion models into few-shot discriminative learners.
Our approach mainly uses the cross-attention score of a Stable Diffusion model to capture the mutual influence between visual and textual information.
arXiv Detail & Related papers (2023-05-18T05:41:36Z) - Text-to-Image Diffusion Models are Zero-Shot Classifiers [8.26990105697146]
We investigate text-to-image diffusion models by proposing a method for evaluating them as zero-shot classifiers.
We apply our method to Stable Diffusion and Imagen, using it to probe fine-grained aspects of the models' knowledge.
They perform competitively with CLIP on a wide range of zero-shot image classification datasets.
arXiv Detail & Related papers (2023-03-27T14:15:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.