Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
- URL: http://arxiv.org/abs/2412.15939v1
- Date: Fri, 20 Dec 2024 14:32:56 GMT
- Title: Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
- Authors: Gautier Evennou, Antoine Chaffin, Vivien Chappelier, Ewa Kijak,
- Abstract summary: We introduce BLIP2IDC, an adaptation of BLIP2 to the Image Difference Captioning (IDC) task at low computational cost.
We show it outperforms two-streams approaches by a significant margin on real-world IDC datasets.
We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion.
- Score: 5.887986127737718
- License:
- Abstract: The rise of the generative models quality during the past years enabled the generation of edited variations of images at an important scale. To counter the harmful effects of such technology, the Image Difference Captioning (IDC) task aims to describe the differences between two images. While this task is successfully handled for simple 3D rendered images, it struggles on real-world images. The reason is twofold: the training data-scarcity, and the difficulty to capture fine-grained differences between complex images. To address those issues, we propose in this paper a simple yet effective framework to both adapt existing image captioning models to the IDC task and augment IDC datasets. We introduce BLIP2IDC, an adaptation of BLIP2 to the IDC task at low computational cost, and show it outperforms two-streams approaches by a significant margin on real-world IDC datasets. We also propose to use synthetic augmentation to improve the performance of IDC models in an agnostic fashion. We show that our synthetic augmentation strategy provides high quality data, leading to a challenging new dataset well-suited for IDC named Syned1.
Related papers
- BD-Diff: Generative Diffusion Model for Image Deblurring on Unknown Domains with Blur-Decoupled Learning [55.21345354747609]
BD-Diff is a generative-diffusion-based model designed to enhance deblurring performance on unknown domains.
We employ two Q-Formers as structural representations and blur patterns extractors separately.
We introduce a reconstruction task to make the structural features and blur patterns complementary.
arXiv Detail & Related papers (2025-02-03T17:00:40Z) - Large-Scale Data-Free Knowledge Distillation for ImageNet via Multi-Resolution Data Generation [53.95204595640208]
Data-Free Knowledge Distillation (DFKD) is an advanced technique that enables knowledge transfer from a teacher model to a student model without relying on original training data.
Previous approaches have generated synthetic images at high resolutions without leveraging information from real images.
MUSE generates images at lower resolutions while using Class Activation Maps (CAMs) to ensure that the generated images retain critical, class-specific features.
arXiv Detail & Related papers (2024-11-26T02:23:31Z) - Diffusion Curriculum: Synthetic-to-Real Generative Curriculum Learning via Image-Guided Diffusion [19.54496184675988]
Low-quality or scarce data has posed significant challenges for training deep neural networks in practice.
Diffusion Curriculum (DisCL) adjusts the image guidance level of image synthesis for each training stage.
DisCL focuses on lower-guidance images of high-quality to learn features as a warm-up of learning higher-guidance images that might be weak on diversity or quality.
arXiv Detail & Related papers (2024-10-17T15:33:35Z) - OneDiff: A Generalist Model for Image Difference Captioning [5.71214984158106]
Image Difference Captioning (IDC) is crucial for accurately describing variations between closely related images.
OneDiff is a novel generalist approach that utilizes a robust vision-language model architecture.
OneDiff consistently outperforms existing state-of-the-art models in accuracy and adaptability.
arXiv Detail & Related papers (2024-07-08T06:14:37Z) - Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model [80.61157097223058]
A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models.
In this study, we scrutinize the shortcomings of both current generative and conventional data augmentation techniques.
We introduce an innovative inter-class data augmentation method known as Diff-Mix, which enriches the dataset by performing image translations between classes.
arXiv Detail & Related papers (2024-03-28T17:23:45Z) - Exposure Bracketing Is All You Need For A High-Quality Image [50.822601495422916]
Multi-exposure images are complementary in denoising, deblurring, high dynamic range imaging, and super-resolution.
We propose to utilize exposure bracketing photography to get a high-quality image by combining these tasks in this work.
In particular, a temporally modulated recurrent network (TMRNet) and self-supervised adaptation method are proposed.
arXiv Detail & Related papers (2024-01-01T14:14:35Z) - Unlocking Pre-trained Image Backbones for Semantic Image Synthesis [29.688029979801577]
We propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images.
Our model, which we dub DP-SIMS, achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K, COCO-Stuff, and Cityscapes.
arXiv Detail & Related papers (2023-12-20T09:39:19Z) - DGNet: Dynamic Gradient-Guided Network for Water-Related Optics Image
Enhancement [77.0360085530701]
Underwater image enhancement (UIE) is a challenging task due to the complex degradation caused by underwater environments.
Previous methods often idealize the degradation process, and neglect the impact of medium noise and object motion on the distribution of image features.
Our approach utilizes predicted images to dynamically update pseudo-labels, adding a dynamic gradient to optimize the network's gradient space.
arXiv Detail & Related papers (2023-12-12T06:07:21Z) - Image Difference Captioning with Pre-training and Contrastive Learning [45.59621065755761]
The Image Difference Captioning (IDC) task aims to describe the visual differences between two similar images with natural language.
The major challenges of this task lie in two aspects: 1) fine-grained visual differences that require learning stronger vision and language association and 2) high-cost of manual annotations.
We propose a new modeling framework following the pre-training-finetuning paradigm to address these challenges.
arXiv Detail & Related papers (2022-02-09T06:14:22Z) - Learning Deformable Image Registration from Optimization: Perspective,
Modules, Bilevel Training and Beyond [62.730497582218284]
We develop a new deep learning based framework to optimize a diffeomorphic model via multi-scale propagation.
We conduct two groups of image registration experiments on 3D volume datasets including image-to-atlas registration on brain MRI data and image-to-image registration on liver CT data.
arXiv Detail & Related papers (2020-04-30T03:23:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.