Unified Contrastive Learning in Image-Text-Label Space
- URL: http://arxiv.org/abs/2204.03610v1
- Date: Thu, 7 Apr 2022 17:34:51 GMT
- Title: Unified Contrastive Learning in Image-Text-Label Space
- Authors: Jianwei Yang, Chunyuan Li, Pengchuan Zhang, Bin Xiao, Ce Liu, Lu Yuan,
Jianfeng Gao
- Abstract summary: Unified Contrastive Learning (UniCL) is effective way of learning semantically rich yet discriminative representations.
UniCL stand-alone is a good learner on pure imagelabel data, rivaling the supervised learning methods across three image classification datasets.
- Score: 130.31947133453406
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual recognition is recently learned via either supervised learning on
human-annotated image-label data or language-image contrastive learning with
webly-crawled image-text pairs. While supervised learning may result in a more
discriminative representation, language-image pretraining shows unprecedented
zero-shot recognition capability, largely due to the different properties of
data sources and learning objectives. In this work, we introduce a new
formulation by combining the two data sources into a common image-text-label
space. In this space, we propose a new learning paradigm, called Unified
Contrastive Learning (UniCL) with a single learning objective to seamlessly
prompt the synergy of two data types. Extensive experiments show that our UniCL
is an effective way of learning semantically rich yet discriminative
representations, universally for image recognition in zero-shot, linear-probe,
fully finetuning and transfer learning scenarios. Particularly, it attains
gains up to 9.2% and 14.5% in average on zero-shot recognition benchmarks over
the language-image contrastive learning and supervised learning methods,
respectively. In linear probe setting, it also boosts the performance over the
two methods by 7.3% and 3.4%, respectively. Our study also indicates that UniCL
stand-alone is a good learner on pure image-label data, rivaling the supervised
learning methods across three image classification datasets and two types of
vision backbones, ResNet and Swin Transformer. Code is available at
https://github.com/microsoft/UniCL.
Related papers
- AddressCLIP: Empowering Vision-Language Models for City-wide Image Address Localization [57.34659640776723]
We propose an end-to-end framework named AddressCLIP to solve the problem with more semantics.
We have built three datasets from Pittsburgh and San Francisco on different scales specifically for the IAL problem.
arXiv Detail & Related papers (2024-07-11T03:18:53Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Align before Fuse: Vision and Language Representation Learning with
Momentum Distillation [52.40490994871753]
We introduce a contrastive loss to representations BEfore Fusing (ALBEF) through cross-modal attention.
We propose momentum distillation, a self-training method which learns from pseudo-targets produced by a momentum model.
ALBEF achieves state-of-the-art performance on multiple downstream vision-language tasks.
arXiv Detail & Related papers (2021-07-16T00:19:22Z) - AugNet: End-to-End Unsupervised Visual Representation Learning with
Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures.
Our experiments demonstrate that the method is able to represent the image in low dimensional space.
Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.