UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks
- URL: http://arxiv.org/abs/2306.04715v1
- Date: Wed, 7 Jun 2023 18:26:22 GMT
- Title: UniBoost: Unsupervised Unimodal Pre-training for Boosting Zero-shot
Vision-Language Tasks
- Authors: Yanan Sun and Zihan Zhong and Qi Fan and Chi-Keung Tang and Yu-Wing
Tai
- Abstract summary: Using large-scale unsupervised unimodal models as pre-training can enhance the zero-shot performance of image-text pair models.
Our experiments show that unimodal pre-training outperforms state-of-the-art CLIP-based models.
- Score: 60.46473247205654
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale joint training of multimodal models, e.g., CLIP, have
demonstrated great performance in many vision-language tasks. However,
image-text pairs for pre-training are restricted to the intersection of images
and texts, limiting their ability to cover a large distribution of real-world
data, where noise can also be introduced as misaligned pairs during
pre-processing. Conversely, unimodal models trained on text or image data alone
through unsupervised techniques can achieve broader coverage of diverse
real-world data and are not constrained by the requirement of simultaneous
presence of image and text. In this paper, we demonstrate that using
large-scale unsupervised unimodal models as pre-training can enhance the
zero-shot performance of image-text pair models. Our thorough studies validate
that models pre-trained as such can learn rich representations of both
modalities, improving their ability to understand how images and text relate to
each other. Our experiments show that unimodal pre-training outperforms
state-of-the-art CLIP-based models by 6.5% (52.3% $\rightarrow$ 58.8%) on
PASCAL-5$^i$ and 6.2% (27.2% $\rightarrow$ 33.4%) on COCO-20$^i$ semantic
segmentation under zero-shot setting respectively. By learning representations
of both modalities, unimodal pre-training offers broader coverage, reduced
misalignment errors, and the ability to capture more complex features and
patterns in the real-world data resulting in better performance especially for
zero-shot vision-language tasks.
Related papers
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension [131.14381425260706]
We introduce Self-Training on Image (STIC), which emphasizes a self-training approach specifically for image comprehension.
First, the model self-constructs a preference for image descriptions using unlabeled images.
To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data.
arXiv Detail & Related papers (2024-05-30T05:53:49Z) - Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Transferring Pre-trained Multimodal Representations with Cross-modal
Similarity Matching [49.730741713652435]
In this paper, we propose a method that can effectively transfer the representations of a large pre-trained multimodal model into a small target model.
For unsupervised transfer, we introduce cross-modal similarity matching (CSM) that enables a student model to learn the representations of a teacher model.
To better encode the text prompts, we design context-based prompt augmentation (CPA) that can alleviate the lexical ambiguity of input text prompts.
arXiv Detail & Related papers (2023-01-07T17:24:11Z) - Prefix Language Models are Unified Modal Learners [30.666873206462295]
We show that a unified modal model could be learned with a prefix language modeling objective upon text and image sequences.
Thanks to the simple but powerful pre-training paradigm, our proposed model, DaVinci, is simple to train, scalable to huge data, and adaptable to a variety of downstream tasks.
arXiv Detail & Related papers (2022-06-15T17:49:38Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.