DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation
- URL: http://arxiv.org/abs/2502.04378v1
- Date: Wed, 05 Feb 2025 16:35:42 GMT
- Title: DILLEMA: Diffusion and Large Language Models for Multi-Modal Augmentation
- Authors: Luciano Baresi, Davide Yi Xian Hu, Muhammad Irfan Mas'udi, Giovanni Quattrocchi,
- Abstract summary: We present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models.
Our approach begins by translating images into detailed textual descriptions using a captioning model.
These descriptions are then used to produce new test images through a text-to-image diffusion process.
- Score: 0.13124513975412253
- License:
- Abstract: Ensuring the robustness of deep learning models requires comprehensive and diverse testing. Existing approaches, often based on simple data augmentation techniques or generative adversarial networks, are limited in producing realistic and varied test cases. To address these limitations, we present a novel framework for testing vision neural networks that leverages Large Language Models and control-conditioned Diffusion Models to generate synthetic, high-fidelity test cases. Our approach begins by translating images into detailed textual descriptions using a captioning model, allowing the language model to identify modifiable aspects of the image and generate counterfactual descriptions. These descriptions are then used to produce new test images through a text-to-image diffusion process that preserves spatial consistency and maintains the critical elements of the scene. We demonstrate the effectiveness of our method using two datasets: ImageNet1K for image classification and SHIFT for semantic segmentation in autonomous driving. The results show that our approach can generate significant test cases that reveal weaknesses and improve the robustness of the model through targeted retraining. We conducted a human assessment using Mechanical Turk to validate the generated images. The responses from the participants confirmed, with high agreement among the voters, that our approach produces valid and realistic images.
Related papers
- Beyond-Labels: Advancing Open-Vocabulary Segmentation With Vision-Language Models [0.0]
Self-supervised learning can resolve numerous image or linguistic processing problems when effectively trained.
This study investigated simple yet efficient methods for adapting previously learned foundation models for semantic segmentation tasks.
Our research proposed "Beyond-Labels," a lightweight transformer-based fusion module that uses a handful of image segmentation data to fuse frozen image representations with language concepts.
arXiv Detail & Related papers (2025-01-28T07:49:52Z) - FUSE-ing Language Models: Zero-Shot Adapter Discovery for Prompt Optimization Across Tokenizers [55.2480439325792]
We propose FUSE, an approach to approximating an adapter layer that maps from one model's textual embedding space to another, even across different tokenizers.
We show the efficacy of our approach via multi-objective optimization over vision-language and causal language models for image captioning and sentiment-based image captioning.
arXiv Detail & Related papers (2024-08-09T02:16:37Z) - Reinforcing Pre-trained Models Using Counterfactual Images [54.26310919385808]
This paper proposes a novel framework to reinforce classification models using language-guided generated counterfactual images.
We identify model weaknesses by testing the model using the counterfactual image dataset.
We employ the counterfactual images as an augmented dataset to fine-tune and reinforce the classification model.
arXiv Detail & Related papers (2024-06-19T08:07:14Z) - Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers [120.49126407479717]
This paper explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR)
We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos.
arXiv Detail & Related papers (2024-03-12T00:02:03Z) - Photorealistic and Identity-Preserving Image-Based Emotion Manipulation
with Latent Diffusion Models [31.55798962786664]
We investigate the emotion manipulation capabilities of diffusion models with "in-the-wild" images.
We conduct extensive evaluations on AffectNet, demonstrating the superiority of our approach in terms of image quality and realism.
arXiv Detail & Related papers (2023-08-06T18:28:26Z) - LANCE: Stress-testing Visual Models by Generating Language-guided
Counterfactual Images [20.307968197151897]
We propose an automated algorithm to stress-test a trained visual model by generating language-guided counterfactual test images (LANCE)
Our method leverages recent progress in large language modeling and text-based image editing to augment an IID test set with a suite of diverse, realistic, and challenging test images without altering model weights.
arXiv Detail & Related papers (2023-05-30T16:09:16Z) - SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with
Large Language Models [56.88192537044364]
We propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models.
Our approach can make text-to-image diffusion models easier to use with better user experience.
arXiv Detail & Related papers (2023-05-09T05:48:38Z) - Taming Encoder for Zero Fine-tuning Image Customization with
Text-to-Image Diffusion Models [55.04969603431266]
This paper proposes a method for generating images of customized objects specified by users.
The method is based on a general framework that bypasses the lengthy optimization required by previous approaches.
We demonstrate through experiments that our proposed method is able to synthesize images with compelling output quality, appearance diversity, and object fidelity.
arXiv Detail & Related papers (2023-04-05T17:59:32Z) - Parents and Children: Distinguishing Multimodal DeepFakes from Natural Images [60.34381768479834]
Recent advancements in diffusion models have enabled the generation of realistic deepfakes from textual prompts in natural language.
We pioneer a systematic study on deepfake detection generated by state-of-the-art diffusion models.
arXiv Detail & Related papers (2023-04-02T10:25:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.