Mutual Contrastive Learning for Visual Representation Learning
- URL: http://arxiv.org/abs/2104.12565v1
- Date: Mon, 26 Apr 2021 13:32:33 GMT
- Title: Mutual Contrastive Learning for Visual Representation Learning
- Authors: Chuanguang Yang, Zhulin An, Linhang Cai, Yongjun Xu
- Abstract summary: We present a collaborative learning method called Mutual Contrastive Learning (MCL) for general visual representation learning.
Benefiting from MCL, each model can learn extra contrastive knowledge from others, leading to more meaningful feature representations.
Experimental results on supervised and self-supervised image classification, transfer learning and few-shot learning show that MCL can lead to consistent performance gains.
- Score: 1.9355744690301404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a collaborative learning method called Mutual Contrastive Learning
(MCL) for general visual representation learning. The core idea of MCL is to
perform mutual interaction and transfer of contrastive distributions among a
cohort of models. Benefiting from MCL, each model can learn extra contrastive
knowledge from others, leading to more meaningful feature representations for
visual recognition tasks. We emphasize that MCL is conceptually simple yet
empirically powerful. It is a generic framework that can be applied to both
supervised and self-supervised representation learning. Experimental results on
supervised and self-supervised image classification, transfer learning and
few-shot learning show that MCL can lead to consistent performance gains,
demonstrating that MCL can guide the network to generate better feature
representations.
Related papers
- X-Former: Unifying Contrastive and Reconstruction Learning for MLLMs [49.30255148577368]
X-Former is a lightweight transformer module designed to exploit the complementary strengths of CL and MIM.
X-Former first bootstraps vision-language representation learning and multimodal-to-multimodal generative learning from two frozen vision encoders.
It further bootstraps vision-to-language generative learning from a frozen LLM to ensure visual features from X-Former can be interpreted by the LLM.
arXiv Detail & Related papers (2024-07-18T18:39:54Z) - Harmony: A Joint Self-Supervised and Weakly-Supervised Framework for Learning General Purpose Visual Representations [6.990891188823598]
We present Harmony, a framework that combines vision-language training with discriminative and generative self-supervision to learn visual features.
Our framework is specifically designed to work on web-scraped data by not relying on negative examples and addressing the one-to-one correspondence issue.
arXiv Detail & Related papers (2024-05-23T07:18:08Z) - A Probabilistic Model Behind Self-Supervised Learning [53.64989127914936]
In self-supervised learning (SSL), representations are learned via an auxiliary task without annotated labels.
We present a generative latent variable model for self-supervised learning.
We show that several families of discriminative SSL, including contrastive methods, induce a comparable distribution over representations.
arXiv Detail & Related papers (2024-02-02T13:31:17Z) - Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [50.77984109941538]
Our research reveals that the visual capabilities in recent multimodal LLMs still exhibit systematic shortcomings.
We identify ''CLIP-blind pairs'' - images that CLIP perceives as similar despite their clear visual differences.
We evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs.
arXiv Detail & Related papers (2024-01-11T18:58:36Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Semantically Consistent Multi-view Representation Learning [11.145085584637744]
We propose a novel Semantically Consistent Multi-view Representation Learning (SCMRL)
SCMRL excavates underlying multi-view semantic consensus information and utilize the information to guide the unified feature representation learning.
Compared with several state-of-the-art algorithms, extensive experiments demonstrate its superiority.
arXiv Detail & Related papers (2023-03-08T04:27:46Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Online Knowledge Distillation via Mutual Contrastive Learning for Visual
Recognition [27.326420185846327]
We present a Mutual Contrastive Learning (MCL) framework for online Knowledge Distillation (KD)
Our MCL can aggregate cross-network embedding information and maximize the lower bound to the mutual information between two networks.
Experiments on image classification and transfer learning to visual recognition tasks show that layer-wise MCL can lead to consistent performance gains.
arXiv Detail & Related papers (2022-07-23T13:39:01Z) - Representation Learning via Consistent Assignment of Views to Clusters [0.7614628596146599]
Consistent Assignment for Representation Learning (CARL) is an unsupervised learning method to learn visual representations.
By viewing contrastive learning from a clustering perspective, CARL learns unsupervised representations by learning a set of general prototypes.
Unlike contemporary work on contrastive learning with deep clustering, CARL proposes to learn the set of general prototypes in an online fashion.
arXiv Detail & Related papers (2021-12-31T12:59:23Z) - 3D Human Action Representation Learning via Cross-View Consistency
Pursuit [52.19199260960558]
We propose a Cross-view Contrastive Learning framework for unsupervised 3D skeleton-based action Representation (CrosSCLR)
CrosSCLR consists of both single-view contrastive learning (SkeletonCLR) and cross-view consistent knowledge mining (CVC-KM) modules, integrated in a collaborative learning manner.
arXiv Detail & Related papers (2021-04-29T16:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.