CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
- URL: http://arxiv.org/abs/2410.11963v1
- Date: Tue, 15 Oct 2024 18:06:41 GMT
- Title: CtrlSynth: Controllable Image Text Synthesis for Data-Efficient Multimodal Learning
- Authors: Qingqing Cao, Mahyar Najibi, Sachin Mehta,
- Abstract summary: We design a emphcontrollable image-text synthesis pipeline, Ctrl Synth, for data-efficient and robust learning.
Ctrl Synth allows users to control data synthesis in a fine-grained manner by defining customized control policies.
We show that Ctrl Synth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.
- Score: 23.63386159778117
- License:
- Abstract: Pretraining robust vision or multimodal foundation models (e.g., CLIP) relies on large-scale datasets that may be noisy, potentially misaligned, and have long-tail distributions. Previous works have shown promising results in augmenting datasets by generating synthetic samples. However, they only support domain-specific ad hoc use cases (e.g., either image or text only, but not both), and are limited in data diversity due to a lack of fine-grained control over the synthesis process. In this paper, we design a \emph{controllable} image-text synthesis pipeline, CtrlSynth, for data-efficient and robust multimodal learning. The key idea is to decompose the visual semantics of an image into basic elements, apply user-specified control policies (e.g., remove, add, or replace operations), and recompose them to synthesize images or texts. The decompose and recompose feature in CtrlSynth allows users to control data synthesis in a fine-grained manner by defining customized control policies to manipulate the basic elements. CtrlSynth leverages the capabilities of pretrained foundation models such as large language models or diffusion models to reason and recompose basic elements such that synthetic samples are natural and composed in diverse ways. CtrlSynth is a closed-loop, training-free, and modular framework, making it easy to support different pretrained models. With extensive experiments on 31 datasets spanning different vision and vision-language tasks, we show that CtrlSynth substantially improves zero-shot classification, image-text retrieval, and compositional reasoning performance of CLIP models.
Related papers
- Diversity-Driven Synthesis: Enhancing Dataset Distillation through Directed Weight Adjustment [39.137060714048175]
We argue that enhancing diversity can improve the parallelizable yet isolated approach to synthesizing datasets.
We introduce a novel method that employs dynamic and directed weight adjustment techniques to modulate the synthesis process.
Our method ensures that each batch of synthetic data mirrors the characteristics of a large, varying subset of the original dataset.
arXiv Detail & Related papers (2024-09-26T08:03:19Z) - SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor.
We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z) - Synth$^2$: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings [16.28853186016663]
We create synthetic image-text pairs for efficient and effective Visual-Language Models (VLMs) training.
Our method employs a pretrained text-to-image model to synthesize image embeddings from captions generated by an LLM.
Our VLM, finetuned on synthetic data achieves comparable performance to models trained solely on human-annotated data.
arXiv Detail & Related papers (2024-03-12T15:36:42Z) - SynthCLIP: Are We Ready for a Fully Synthetic CLIP Training? [57.42016037768947]
We present SynthCLIP, a CLIP model trained on entirely synthetic text-image pairs.
We generate synthetic datasets of images and corresponding captions at scale, with no human intervention.
arXiv Detail & Related papers (2024-02-02T18:59:58Z) - ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-real
Novel View Synthesis via Contrastive Learning [102.46382882098847]
We first investigate the effects of synthetic data in synthetic-to-real novel view synthesis.
We propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints.
Our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS.
arXiv Detail & Related papers (2023-03-20T12:06:14Z) - Composer: Creative and Controllable Image Synthesis with Composable
Conditions [57.78533372393828]
Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability.
This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity.
arXiv Detail & Related papers (2023-02-20T05:48:41Z) - StyleGAN-T: Unlocking the Power of GANs for Fast Large-Scale
Text-to-Image Synthesis [54.39789900854696]
StyleGAN-T addresses the specific requirements of large-scale text-to-image synthesis.
It significantly improves over previous GANs and outperforms distilled diffusion models in terms of sample quality and speed.
arXiv Detail & Related papers (2023-01-23T16:05:45Z) - eDiffi: Text-to-Image Diffusion Models with an Ensemble of Expert
Denoisers [87.52504764677226]
Large-scale diffusion-based generative models have led to breakthroughs in text-conditioned high-resolution image synthesis.
We train an ensemble of text-to-image diffusion models specialized for different stages synthesis.
Our ensemble of diffusion models, called eDiffi, results in improved text alignment while maintaining the same inference cost.
arXiv Detail & Related papers (2022-11-02T17:43:04Z) - SynthTIGER: Synthetic Text Image GEneratoR Towards Better Text
Recognition Models [9.934446907923725]
We introduce a new synthetic text image generator, SynthTIGER, by analyzing techniques used for text image synthesis and integrating effective ones under a single algorithm.
In our experiments, SynthTIGER achieves better STR performance than the combination of synthetic datasets.
arXiv Detail & Related papers (2021-07-20T08:03:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.