Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
- URL: http://arxiv.org/abs/2503.03278v1
- Date: Wed, 05 Mar 2025 09:02:33 GMT
- Title: Enhancing Abnormality Grounding for Vision Language Models with Knowledge Descriptions
- Authors: Jun Li, Che Liu, Wenjia Bai, Rossella Arcucci, Cosmin I. Bercea, Julia A. Schnabel,
- Abstract summary: We introduce a novel approach to enhance VLM performance in medical abnormality detection and localization.<n>We focus on breaking down medical concepts into fundamental attributes and common visual patterns.<n>We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs.
- Score: 11.503540826701807
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Language Models (VLMs) have demonstrated impressive capabilities in visual grounding tasks. However, their effectiveness in the medical domain, particularly for abnormality detection and localization within medical images, remains underexplored. A major challenge is the complex and abstract nature of medical terminology, which makes it difficult to directly associate pathological anomaly terms with their corresponding visual features. In this work, we introduce a novel approach to enhance VLM performance in medical abnormality detection and localization by leveraging decomposed medical knowledge. Instead of directly prompting models to recognize specific abnormalities, we focus on breaking down medical concepts into fundamental attributes and common visual patterns. This strategy promotes a stronger alignment between textual descriptions and visual features, improving both the recognition and localization of abnormalities in medical images.We evaluate our method on the 0.23B Florence-2 base model and demonstrate that it achieves comparable performance in abnormality grounding to significantly larger 7B LLaVA-based medical VLMs, despite being trained on only 1.5% of the data used for such models. Experimental results also demonstrate the effectiveness of our approach in both known and previously unseen abnormalities, suggesting its strong generalization capabilities.
Related papers
- Evaluating Visual Explanations of Attention Maps for Transformer-based Medical Imaging [2.6505619784178047]
We compare visual explanations of attention maps to other commonly used methods for medical imaging problems.
We find that attention maps show promise under certain conditions and generally surpass GradCAM in explainability.
Our findings indicate that the efficacy of attention maps as a method of interpretability is context-dependent and may be limited as they do not consistently provide the comprehensive insights required for robust medical decision-making.
arXiv Detail & Related papers (2025-03-12T16:52:52Z) - Training Medical Large Vision-Language Models with Abnormal-Aware Feedback [57.98393950821579]
We propose a novel UMed-LVLM designed with Unveiling Medical abnormalities.<n>We propose a prompt method utilizing the GPT-4V to generate diagnoses based on identified abnormal areas in medical images.<n> Experimental results demonstrate that our UMed-LVLM surpasses existing Med-LVLMs in identifying and understanding medical abnormality.
arXiv Detail & Related papers (2025-01-02T17:37:20Z) - Deep Generative Models Unveil Patterns in Medical Images Through Vision-Language Conditioning [3.4299097748670255]
Deep generative models have significantly advanced medical imaging analysis by enhancing dataset size and quality.
We employ a generative structure with hybrid conditions, combining clinical data and segmentation masks to guide the image synthesis process.
Our approach differs from and presents a more challenging task than traditional medical report-guided synthesis due to the less visual correlation of our clinical information with the images.
arXiv Detail & Related papers (2024-10-17T17:48:36Z) - Potential of Multimodal Large Language Models for Data Mining of Medical Images and Free-text Reports [51.45762396192655]
Multimodal large language models (MLLMs) have recently transformed many domains, significantly affecting the medical field. Notably, Gemini-Vision-series (Gemini) and GPT-4-series (GPT-4) models have epitomized a paradigm shift in Artificial General Intelligence for computer vision.
This study evaluated the performance of the Gemini, GPT-4, and 4 popular large models for an exhaustive evaluation across 14 medical imaging datasets.
arXiv Detail & Related papers (2024-07-08T09:08:42Z) - MediCLIP: Adapting CLIP for Few-shot Medical Image Anomaly Detection [6.812281925604158]
This paper first focuses on the task of medical image anomaly detection in the few-shot setting.
We propose an innovative approach, MediCLIP, which adapts the CLIP model to few-shot medical image anomaly detection through self-supervised fine-tuning.
arXiv Detail & Related papers (2024-05-18T15:24:58Z) - Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images [68.42215385041114]
This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection.
Our approach integrates multiple residual adapters into the pre-trained visual encoder, enabling a stepwise enhancement of visual features across different levels.
Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models.
arXiv Detail & Related papers (2024-03-19T09:28:19Z) - Optimizing Skin Lesion Classification via Multimodal Data and Auxiliary
Task Integration [54.76511683427566]
This research introduces a novel multimodal method for classifying skin lesions, integrating smartphone-captured images with essential clinical and demographic information.
A distinctive aspect of this method is the integration of an auxiliary task focused on super-resolution image prediction.
The experimental evaluations have been conducted using the PAD-UFES20 dataset, applying various deep-learning architectures.
arXiv Detail & Related papers (2024-02-16T05:16:20Z) - Robust and Interpretable Medical Image Classifiers via Concept
Bottleneck Models [49.95603725998561]
We propose a new paradigm to build robust and interpretable medical image classifiers with natural language concepts.
Specifically, we first query clinical concepts from GPT-4, then transform latent image features into explicit concepts with a vision-language model.
arXiv Detail & Related papers (2023-10-04T21:57:09Z) - LVM-Med: Learning Large-Scale Self-Supervised Vision Models for Medical
Imaging via Second-order Graph Matching [59.01894976615714]
We introduce LVM-Med, the first family of deep networks trained on large-scale medical datasets.
We have collected approximately 1.3 million medical images from 55 publicly available datasets.
LVM-Med empirically outperforms a number of state-of-the-art supervised, self-supervised, and foundation models.
arXiv Detail & Related papers (2023-06-20T22:21:34Z) - DiffMIC: Dual-Guidance Diffusion Network for Medical Image
Classification [32.67098520984195]
We propose the first diffusion-based model (named DiffMIC) to address general medical image classification.
Our experimental results demonstrate that DiffMIC outperforms state-of-the-art methods by a significant margin.
arXiv Detail & Related papers (2023-03-19T09:15:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.