TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training
- URL: http://arxiv.org/abs/2312.08846v4
- Date: Sat, 24 Feb 2024 03:30:56 GMT
- Title: TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training
- Authors: Chaoya Jiang, Wei ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang,
Shikun Zhang
- Abstract summary: Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss.
TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
- Score: 42.142924806184425
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and
linguistic modalities. Due to noises in web-harvested text-image pairs,
however, scaling up training data volume in SMCL presents considerable
obstacles in terms of computational cost and data inefficiency. To improve data
efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates
mix-based data augmentation techniques into SMCL, yielding significant
performance improvements without significantly increasing computational
overhead. We provide a theoretical analysis of TiMixfrom a mutual information
(MI) perspective, showing that mixed data samples for cross-modal contrastive
learning implicitly serve as a regularizer for the contrastive loss. The
experimental results demonstrate that TiMix exhibits a comparable performance
on downstream tasks, even with a reduced amount of training data and shorter
training time, when benchmarked against existing methods. This work empirically
and theoretically demonstrates the potential of data mixing for data-efficient
and computationally viable VLP, benefiting broader VLP model adoption in
practical scenarios.
Related papers
- DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks [40.91931801667421]
This paper presents a novel global-to-local algorithm called DUET that can exploit the feedback loop by interleaving a data selection method with Bayesian optimization.
As a result, DUET can efficiently refine the training data mixture from a pool of data domains to maximize the model's performance on the unseen evaluation task.
arXiv Detail & Related papers (2025-02-01T01:52:32Z) - Optimizing Pretraining Data Mixtures with LLM-Estimated Utility [52.08428597962423]
Large Language Models improve with increasing amounts of high-quality training data.
We find token-counts outperform manual and learned mixes, indicating that simple approaches for dataset size and diversity are surprisingly effective.
We propose two complementary approaches: UtiliMax, which extends token-based $200s by incorporating utility estimates from reduced-scale ablations, achieving up to a 10.6x speedup over manual baselines; and Model Estimated Data Utility (MEDU), which leverages LLMs to estimate data utility from small samples, matching ablation-based performance while reducing computational requirements by $simx.
arXiv Detail & Related papers (2025-01-20T21:10:22Z) - BiMix: A Bivariate Data Mixing Law for Language Model Pretraining [47.77701041534746]
The impact of pretraining data composition on model performance remains poorly understood.
$textbfBiMix$ provides a systematic framework for understanding and optimizing data mixtures.
Our work contributes both theoretical insights into data mixing dynamics and practical tools for enhancing LLM training efficiency.
arXiv Detail & Related papers (2024-05-23T09:44:02Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Understanding Multimodal Contrastive Learning and Incorporating Unpaired
Data [19.72282903349282]
We show a general class of nonlinear loss functions for multimodal contrastive learning (MMCL)
We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality.
When we have access to additional unpaired data, we propose a new MMCL loss that incorporates additional unpaired datasets.
arXiv Detail & Related papers (2023-02-13T10:11:05Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.