TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training
- URL: http://arxiv.org/abs/2312.08846v4
- Date: Sat, 24 Feb 2024 03:30:56 GMT
- Title: TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training
- Authors: Chaoya Jiang, Wei ye, Haiyang Xu, Qinghao Ye, Ming Yan, Ji Zhang,
Shikun Zhang
- Abstract summary: Mixed data samples for cross-modal contrastive learning implicitly serve as a regularizer for the contrastive loss.
TiMix exhibits a comparable performance on downstream tasks, even with a reduced amount of training data and shorter training time, when benchmarked against existing methods.
- Score: 42.142924806184425
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Self-supervised Multi-modal Contrastive Learning (SMCL) remarkably advances
modern Vision-Language Pre-training (VLP) models by aligning visual and
linguistic modalities. Due to noises in web-harvested text-image pairs,
however, scaling up training data volume in SMCL presents considerable
obstacles in terms of computational cost and data inefficiency. To improve data
efficiency in VLP, we propose Text-aware Image Mixing (TiMix), which integrates
mix-based data augmentation techniques into SMCL, yielding significant
performance improvements without significantly increasing computational
overhead. We provide a theoretical analysis of TiMixfrom a mutual information
(MI) perspective, showing that mixed data samples for cross-modal contrastive
learning implicitly serve as a regularizer for the contrastive loss. The
experimental results demonstrate that TiMix exhibits a comparable performance
on downstream tasks, even with a reduced amount of training data and shorter
training time, when benchmarked against existing methods. This work empirically
and theoretically demonstrates the potential of data mixing for data-efficient
and computationally viable VLP, benefiting broader VLP model adoption in
practical scenarios.
Related papers
- Entropy Law: The Story Behind Data Compression and LLM Performance [115.70395740286422]
We find that model performance is negatively correlated to the compression ratio of training data, which usually yields a lower training loss.
Based on the findings of the entropy law, we propose a quite efficient and universal data selection method.
We also present an interesting application of entropy law that can detect potential performance risks at the beginning of model training.
arXiv Detail & Related papers (2024-07-09T08:14:29Z) - Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms.
We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law.
Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z) - Single-channel speech enhancement using learnable loss mixup [23.434378634735676]
Generalization remains a major problem in supervised learning of single-channel speech enhancement.
We propose learnable loss mixup (LLM), a simple and effortless training diagram, to improve the generalization of deep learning-based speech enhancement models.
Our experimental results on the VCTK benchmark show that learnable loss mixup 3.26 PESQ, achieves outperforming the state-of-the-art.
arXiv Detail & Related papers (2023-12-20T00:25:55Z) - Understanding Multimodal Contrastive Learning and Incorporating Unpaired
Data [19.72282903349282]
We show a general class of nonlinear loss functions for multimodal contrastive learning (MMCL)
We quantitatively show that the feature learning ability of MMCL can be better than that of unimodal contrastive learning applied to each modality.
When we have access to additional unpaired data, we propose a new MMCL loss that incorporates additional unpaired datasets.
arXiv Detail & Related papers (2023-02-13T10:11:05Z) - VLMixer: Unpaired Vision-Language Pre-training via Cross-Modal CutMix [59.25846149124199]
This paper proposes a data augmentation method, namely cross-modal CutMix.
CMC transforms natural sentences from the textual view into a multi-modal view.
By attaching cross-modal noise on uni-modal data, it guides models to learn token-level interactions across modalities for better denoising.
arXiv Detail & Related papers (2022-06-17T17:56:47Z) - Dense Contrastive Visual-Linguistic Pretraining [53.61233531733243]
Several multimodal representation learning approaches have been proposed that jointly represent image and text.
These approaches achieve superior performance by capturing high-level semantic information from large-scale multimodal pretraining.
We propose unbiased Dense Contrastive Visual-Linguistic Pretraining to replace the region regression and classification with cross-modality region contrastive learning.
arXiv Detail & Related papers (2021-09-24T07:20:13Z) - Mixup-Transformer: Dynamic Data Augmentation for NLP Tasks [75.69896269357005]
Mixup is the latest data augmentation technique that linearly interpolates input examples and the corresponding labels.
In this paper, we explore how to apply mixup to natural language processing tasks.
We incorporate mixup to transformer-based pre-trained architecture, named "mixup-transformer", for a wide range of NLP tasks.
arXiv Detail & Related papers (2020-10-05T23:37:30Z) - Improving accuracy and speeding up Document Image Classification through
parallel systems [4.102028235659611]
We show in the RVL-CDIP dataset that we can improve previous results with a much lighter model.
We present an ensemble pipeline which is able to boost solely image input.
Lastly, we expose the training performance differences between PyTorch and Deep Learning frameworks.
arXiv Detail & Related papers (2020-06-16T13:36:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.