Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
- URL: http://arxiv.org/abs/2503.01167v2
- Date: Sat, 29 Mar 2025 09:39:11 GMT
- Title: Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
- Authors: Haoxin Li, Boyang Li,
- Abstract summary: Vision-Language Models produce Vision-Language Models with proper compositional understanding.<n> synthesizing training images for compositional learning presents three challenges.<n>We proposeSynthetic Perturbations for Advancing Robust Compositional Learning, which integrates image feature injection into a fast text-to-image generative model.
- Score: 7.879286384561264
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Paired image-text data with subtle variations in-between (e.g., people holding surfboards vs. people holding shovels) hold the promise of producing Vision-Language Models with proper compositional understanding. Synthesizing such training data from generative models is a highly coveted prize due to the reduced cost of data collection. However, synthesizing training images for compositional learning presents three challenges: (1) efficiency in generating large quantities of images, (2) text alignment between the generated image and the caption in the exact place of the subtle change, and (3) image fidelity in ensuring sufficient similarity with the original real images in all other places. We propose SPARCL (Synthetic Perturbations for Advancing Robust Compositional Learning), which integrates image feature injection into a fast text-to-image generative model, followed by an image style transfer step, to meet the three challenges. Further, to cope with any residual issues of text alignment, we propose an adaptive margin loss to filter out potentially incorrect synthetic samples and focus the learning on informative hard samples. Evaluation on four compositional understanding benchmarks demonstrates that SPARCL significantly improves the compositionality of CLIP, boosting the average accuracy of the CLIP base model by over 8% across all benchmarks and outperforming state-of-the-art methods by 2% on three benchmarks.
Related papers
- From Mapping to Composing: A Two-Stage Framework for Zero-shot Composed Image Retrieval [30.33315985826623]
Composed Image Retrieval (CIR) is a challenging multimodal task that retrieves a target image based on a reference image and accompanying modification text.
We propose a two-stage framework where the training is accomplished from mapping to composing.
In the first stage, we enhance image-to-pseudo-word token learning by introducing a visual semantic injection module.
In the second stage, we optimize the text encoder using a small amount of synthetic triplet data, enabling it to effectively extract compositional semantics.
arXiv Detail & Related papers (2025-04-25T00:18:23Z) - Data-Efficient Generalization for Zero-shot Composed Image Retrieval [67.46975191141928]
ZS-CIR aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training.
One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space.
We propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic-Set (S-Set)
arXiv Detail & Related papers (2025-03-07T07:49:31Z) - QADM-Net: Multi-Level Quality-Adaptive Dynamic Network for Reliable Multimodal Classification [57.08108545219043]
Current multimodal classification methods lack dynamic networks for sample-specific depth and parameters to achieve reliable inference.<n>We propose Multi-Level Quality-Adaptive Dynamic Multimodal Network (QADM-Net)<n>Experiments conducted on four datasets demonstrate that QADM-Net significantly outperforms state-of-the-art methods in classification performance and reliability.
arXiv Detail & Related papers (2024-12-19T03:26:51Z) - Boosting Semi-Supervised Scene Text Recognition via Viewing and Summarizing [71.29488677105127]
Existing scene text recognition (STR) methods struggle to recognize challenging texts, especially for artistic and severely distorted characters.
We propose a contrastive learning-based STR framework by leveraging synthetic and real unlabeled data without any human cost.
Our method achieves SOTA performance (94.7% and 70.9% average accuracy on common benchmarks and Union14M-Benchmark.
arXiv Detail & Related papers (2024-11-23T15:24:47Z) - EEG-SCMM: Soft Contrastive Masked Modeling for Cross-Corpus EEG-Based Emotion Recognition [0.862468061241377]
We propose a novel Soft Contrastive Masked Modeling (SCMM) framework for emotion recognition.
SCMM integrates soft contrastive learning with a new hybrid masking strategy to effectively mine the "short-term continuity" characteristics inherent in human emotions.
Experiments show that SCMM achieves state-of-the-art (SOTA) performance, outperforming the second-best method by an average accuracy of 4.26%.
arXiv Detail & Related papers (2024-08-17T12:35:13Z) - Img-Diff: Contrastive Data Synthesis for Multimodal Large Language Models [49.439311430360284]
We introduce a novel data synthesis method inspired by contrastive learning and image difference captioning.<n>Our key idea involves challenging the model to discern both matching and distinct elements.<n>We leverage this generated dataset to fine-tune state-of-the-art (SOTA) MLLMs.
arXiv Detail & Related papers (2024-08-08T17:10:16Z) - Integrate the Essence and Eliminate the Dross: Fine-Grained Self-Consistency for Free-Form Language Generation [20.138831477848615]
We propose Fine-Grained Self-Consistency (FSC) to optimize output quality by effectively fine-grained consensus knowledge from multiple samples.
The effectiveness of FSC is demonstrated through extensive experiments on various tasks, including summarization, code generation, and mathematical reasoning.
arXiv Detail & Related papers (2024-07-02T08:38:31Z) - Auto Cherry-Picker: Learning from High-quality Generative Data Driven by Language [41.40908753726324]
Diffusion models can generate realistic and diverse images, potentially facilitating data availability for data-intensive perception tasks.<n>We present textbfAuto textbfCherry-textbfPicker (ACP), a novel framework that generates high-quality cross-modality training samples.
arXiv Detail & Related papers (2024-06-28T17:53:18Z) - Grounded Compositional and Diverse Text-to-3D with Pretrained Multi-View Diffusion Model [65.58911408026748]
We propose Grounded-Dreamer to generate 3D assets that can accurately follow complex, compositional text prompts.
We first advocate leveraging text-guided 4-view images as the bottleneck in the text-to-3D pipeline.
We then introduce an attention refocusing mechanism to encourage text-aligned 4-view image generation.
arXiv Detail & Related papers (2024-04-28T04:05:10Z) - Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - Learning from Synthetic Data for Visual Grounding [55.21937116752679]
We show that SynGround can improve the localization capabilities of off-the-shelf vision-and-language models.
Data generated with SynGround improves the pointing game accuracy of a pretrained ALBEF and BLIP models by 4.81% and 17.11% absolute percentage points, respectively.
arXiv Detail & Related papers (2024-03-20T17:59:43Z) - The Right Losses for the Right Gains: Improving the Semantic Consistency
of Deep Text-to-Image Generation with Distribution-Sensitive Losses [0.35898124827270983]
We propose a contrastive learning approach with a novel combination of two loss functions: fake-to-fake loss and fake-to-real loss.
We test this approach on two baseline models: SSAGAN and AttnGAN.
Results show that our approach improves the qualitative results on AttnGAN with style blocks on the CUB dataset.
arXiv Detail & Related papers (2023-12-18T00:05:28Z) - Self-Evolution Learning for Mixup: Enhance Data Augmentation on Few-Shot
Text Classification Tasks [75.42002070547267]
We propose a self evolution learning (SE) based mixup approach for data augmentation in text classification.
We introduce a novel instance specific label smoothing approach, which linearly interpolates the model's output and one hot labels of the original samples to generate new soft for label mixing up.
arXiv Detail & Related papers (2023-05-22T23:43:23Z) - Generative Image Inpainting with Segmentation Confusion Adversarial
Training and Contrastive Learning [14.358417509144523]
We present a new adversarial training framework for image inpainting with segmentation confusion adversarial training (SCAT) and contrastive learning.
SCAT plays an adversarial game between an inpainting generator and a segmentation network, which provides pixel-level local training signals.
We conduct extensive experiments on two benchmark datasets, demonstrating our model's effectiveness and superiority both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-03-23T09:34:17Z) - Defect Transfer GAN: Diverse Defect Synthesis for Data Augmentation [4.559353193715442]
Defect Transfer GAN (DT-GAN) learns to represent defect types independent of and across various background products.
An empirical study on the MVTec AD and two additional datasets showcase DT-GAN outperforms state-of-the-art image synthesis methods.
Results show that the augmented data from DT-GAN provides consistent gains even in the few samples regime.
arXiv Detail & Related papers (2023-02-16T15:35:21Z) - Robust Cross-Modal Representation Learning with Progressive
Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets.
We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z) - From Synthetic to Real: Image Dehazing Collaborating with Unlabeled Real
Data [58.50411487497146]
We propose a novel image dehazing framework collaborating with unlabeled real data.
First, we develop a disentangled image dehazing network (DID-Net), which disentangles the feature representations into three component maps.
Then a disentangled-consistency mean-teacher network (DMT-Net) is employed to collaborate unlabeled real data for boosting single image dehazing.
arXiv Detail & Related papers (2021-08-06T04:00:28Z) - Improving Text-to-Image Synthesis Using Contrastive Learning [4.850820365312369]
We propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images.
We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO.
arXiv Detail & Related papers (2021-07-06T06:43:31Z) - IMAGINE: Image Synthesis by Image-Guided Model Inversion [79.4691654458141]
We introduce an inversion based method, denoted as IMAge-Guided model INvErsion (IMAGINE), to generate high-quality and diverse images.
We leverage the knowledge of image semantics from a pre-trained classifier to achieve plausible generations.
IMAGINE enables the synthesis procedure to simultaneously 1) enforce semantic specificity constraints during the synthesis, 2) produce realistic images without generator training, and 3) give users intuitive control over the generation process.
arXiv Detail & Related papers (2021-04-13T02:00:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.