Cap2Aug: Caption guided Image to Image data Augmentation
- URL: http://arxiv.org/abs/2212.05404v2
- Date: Tue, 7 Nov 2023 04:04:10 GMT
- Title: Cap2Aug: Caption guided Image to Image data Augmentation
- Authors: Aniket Roy, Anshul Shah, Ketul Shah, Anirban Roy, Rama Chellappa
- Abstract summary: Cap2Aug is an image-to-image diffusion model-based data augmentation strategy using image captions as text prompts.
We generate captions from the limited training images and using these captions edit the training images using an image-to-image stable diffusion model.
This strategy generates augmented versions of images similar to the training images yet provides semantic diversity across the samples.
- Score: 41.53127698828463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual recognition in a low-data regime is challenging and often prone to
overfitting. To mitigate this issue, several data augmentation strategies have
been proposed. However, standard transformations, e.g., rotation, cropping, and
flipping provide limited semantic variations. To this end, we propose Cap2Aug,
an image-to-image diffusion model-based data augmentation strategy using image
captions as text prompts. We generate captions from the limited training images
and using these captions edit the training images using an image-to-image
stable diffusion model to generate semantically meaningful augmentations. This
strategy generates augmented versions of images similar to the training images
yet provides semantic diversity across the samples. We show that the variations
within the class can be captured by the captions and then translated to
generate diverse samples using the image-to-image diffusion model guided by the
captions. However, naive learning on synthetic images is not adequate due to
the domain gap between real and synthetic images. Thus, we employ a maximum
mean discrepancy (MMD) loss to align the synthetic images to the real images
for minimizing the domain gap. We evaluate our method on few-shot and long-tail
classification tasks and obtain performance improvements over state-of-the-art,
especially in the low-data regimes.
Related papers
- Diversified in-domain synthesis with efficient fine-tuning for few-shot
classification [64.86872227580866]
Few-shot image classification aims to learn an image classifier using only a small set of labeled examples per class.
We propose DISEF, a novel approach which addresses the generalization challenge in few-shot learning using synthetic data.
We validate our method in ten different benchmarks, consistently outperforming baselines and establishing a new state-of-the-art for few-shot classification.
arXiv Detail & Related papers (2023-12-05T17:18:09Z) - Improving Diffusion-based Image Translation using Asymmetric Gradient
Guidance [51.188396199083336]
We present an approach that guides the reverse process of diffusion sampling by applying asymmetric gradient guidance.
Our model's adaptability allows it to be implemented with both image-fusion and latent-dif models.
Experiments show that our method outperforms various state-of-the-art models in image translation tasks.
arXiv Detail & Related papers (2023-06-07T12:56:56Z) - Wavelet-based Unsupervised Label-to-Image Translation [9.339522647331334]
We propose a new Unsupervised paradigm for SIS (USIS) that makes use of a self-supervised segmentation loss and whole image wavelet based discrimination.
We test our methodology on 3 challenging datasets and demonstrate its ability to bridge the performance gap between paired and unpaired models.
arXiv Detail & Related papers (2023-05-16T17:48:44Z) - Discriminative Class Tokens for Text-to-Image Diffusion Models [107.98436819341592]
We propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text.
Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images.
We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier.
arXiv Detail & Related papers (2023-03-30T05:25:20Z) - Masked and Adaptive Transformer for Exemplar Based Image Translation [16.93344592811513]
Cross-domain semantic matching is challenging.
We propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence.
We devise a novel contrastive style learning method, for acquire quality-discriminative style representations.
arXiv Detail & Related papers (2023-03-30T03:21:14Z) - Effective Data Augmentation With Diffusion Models [65.09758931804478]
We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models.
Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples.
We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
arXiv Detail & Related papers (2023-02-07T20:42:28Z) - Diffusion-based Image Translation using Disentangled Style and Content
Representation [51.188396199083336]
Diffusion-based image translation guided by semantic texts or a single target image has enabled flexible style transfer.
It is often difficult to maintain the original content of the image during the reverse diffusion.
We present a novel diffusion-based unsupervised image translation method using disentangled style and content representation.
Our experimental results show that the proposed method outperforms state-of-the-art baseline models in both text-guided and image-guided translation tasks.
arXiv Detail & Related papers (2022-09-30T06:44:37Z) - Hierarchical Text-Conditional Image Generation with CLIP Latents [20.476720970770128]
We show that explicitly generating image representations improves image diversity with minimal loss in photorealism and caption similarity.
Our decoders conditioned on image representations can also produce variations of an image that preserve both its semantics and style.
arXiv Detail & Related papers (2022-04-13T01:10:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.