Related papers: Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging

URL: http://arxiv.org/abs/2503.23618v1
Date: Sun, 30 Mar 2025 22:49:26 GMT
Title: Leveraging Vision-Language Foundation Models to Reveal Hidden Image-Attribute Relationships in Medical Imaging
Authors: Amar Kumar, Anita Kriz, Barak Pertzov, Tal Arbel,
Abstract summary: Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text.<n>In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?'
Score: 0.768721532845575
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Vision-language foundation models (VLMs) have shown impressive performance in guiding image generation through text, with emerging applications in medical imaging. In this work, we are the first to investigate the question: 'Can fine-tuned foundation models help identify critical, and possibly unknown, data properties?' By evaluating our proposed method on a chest x-ray dataset, we show that these models can generate high-resolution, precisely edited images compared to methods that rely on Structural Causal Models (SCMs) according to numerous metrics. For the first time, we demonstrate that fine-tuned VLMs can reveal hidden data relationships that were previously obscured due to available metadata granularity and model capacity limitations. Our experiments demonstrate both the potential of these models to reveal underlying dataset properties while also exposing the limitations of fine-tuned VLMs for accurate image editing and susceptibility to biases and spurious correlations.

Related papers

ForenX: Towards Explainable AI-Generated Image Detection with Multimodal Large Language Models [82.04858317800097]
We present ForenX, a novel method that not only identifies the authenticity of images but also provides explanations that resonate with human thoughts.<n>ForenX employs the powerful multimodal large language models (MLLMs) to analyze and interpret forensic cues.<n>We introduce ForgReason, a dataset dedicated to descriptions of forgery evidences in AI-generated images.
arXiv Detail & Related papers (2025-08-02T15:21:26Z)
PRETI: Patient-Aware Retinal Foundation Model via Metadata-Guided Representation Learning [3.771396977579353]
PRETI is a retinal foundation model that integrates metadata-aware learning with robust self-supervised representation learning.<n>We construct patient-level data pairs, associating images from the same individual to improve robustness against non-clinical variations.<n>Experiments demonstrate PRETI achieves state-of-the-art results across diverse diseases and biomarker predictions.
arXiv Detail & Related papers (2025-05-18T04:59:03Z)
Mapping the Mind of an Instruction-based Image Editing using SMILE [8.773288793688998]
We introduce SMILE (Statistical Model-agnostic Interpretability with Local Explanations), a novel model-agnostic for localized interpretability. We show how our model can improve interpretability and reliability. These findings indicate the exciting potential of model-agnostic interpretability for reliability and trustworthiness in critical applications.
arXiv Detail & Related papers (2024-12-20T18:33:23Z)
Image Regeneration: Evaluating Text-to-Image Model via Generating Identical Image with Multimodal Large Language Models [54.052963634384945]
We introduce the Image Regeneration task to assess text-to-image models. We use GPT4V to bridge the gap between the reference image and the text input for the T2I model. We also present ImageRepainter framework to enhance the quality of generated images.
arXiv Detail & Related papers (2024-11-14T13:52:43Z)
Exploring Foundation Models for Synthetic Medical Imaging: A Study on Chest X-Rays and Fine-Tuning Techniques [0.49000940389224884]
Machine learning has significantly advanced healthcare by aiding in disease prevention and treatment identification. However, accessing patient data can be challenging due to privacy concerns and strict regulations. Recent studies suggest that fine-tuning foundation models can produce such data effectively.
arXiv Detail & Related papers (2024-09-06T17:36:08Z)
ARMADA: Attribute-Based Multimodal Data Augmentation [93.05614922383822]
Attribute-based Multimodal Data Augmentation (ARMADA) is a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.
arXiv Detail & Related papers (2024-08-19T15:27:25Z)
Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images. We identify model weaknesses by testing the model using the counterfactual image dataset. We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z)
Sequential Modeling Enables Scalable Learning for Large Vision Models [120.91839619284431]
We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. We define a common format, "visual sentences", in which we can represent raw images and videos as well as annotated data sources.
arXiv Detail & Related papers (2023-12-01T18:59:57Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Evaluating Data Attribution for Text-to-Image Models [62.844382063780365]
We evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style. Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction. By taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images.
arXiv Detail & Related papers (2023-06-15T17:59:51Z)
Controllable Image Generation via Collage Representations [31.456445433105415]
"Mixing and matching scenes" (M&Ms) is an approach that consists of an adversarially trained generative image model conditioned on appearance features and spatial positions of the different elements in a collage. We show that M&Ms outperforms baselines in terms of fine-grained scene controllability while being very competitive in terms of image quality and sample diversity.
arXiv Detail & Related papers (2023-04-26T17:58:39Z)
Diagnosing and Rectifying Vision Models using Language [31.588965563961573]
Recent contrastive learning models have demonstrated the ability to learn an embedding space suitable for building strong vision classifiers. Our work highlights a distinct advantage of this multi-modal embedding space: the ability to diagnose vision classifiers through natural language. Our proposed method can discover high-error data slices, identify influential attributes and further rectify undesirable model behaviors.
arXiv Detail & Related papers (2023-02-08T18:59:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.