A Multi-view Perspective of Self-supervised Learning
- URL: http://arxiv.org/abs/2003.00877v2
- Date: Fri, 15 May 2020 04:20:00 GMT
- Title: A Multi-view Perspective of Self-supervised Learning
- Authors: Chuanxing Geng, Zhenghao Tan, Songcan Chen
- Abstract summary: Self-supervised learning (SSL) recently gained widespread attention, which usually introduces a pretext task without manual annotation of data.
In this paper, we borrow a multi-view perspective to decouple a class of popular pretext tasks into a combination of view data augmentation (VDA) and view label classification (VLC)
Specifically, a simple multi-view learning framework is specially designed (SSL-MV), which assists the feature learning of downstream tasks (original view) through the same tasks on the augmented views.
- Score: 24.14738533504335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a newly emerging unsupervised learning paradigm, self-supervised learning
(SSL) recently gained widespread attention, which usually introduces a pretext
task without manual annotation of data. With its help, SSL effectively learns
the feature representation beneficial for downstream tasks. Thus the pretext
task plays a key role. However, the study of its design, especially its essence
currently is still open. In this paper, we borrow a multi-view perspective to
decouple a class of popular pretext tasks into a combination of view data
augmentation (VDA) and view label classification (VLC), where we attempt to
explore the essence of such pretext task while providing some insights into its
design. Specifically, a simple multi-view learning framework is specially
designed (SSL-MV), which assists the feature learning of downstream tasks
(original view) through the same tasks on the augmented views. SSL-MV focuses
on VDA while abandons VLC, empirically uncovering that it is VDA rather than
generally considered VLC that dominates the performance of such SSL.
Additionally, thanks to replacing VLC with VDA tasks, SSL-MV also enables an
integrated inference combining the predictions from the augmented views,
further improving the performance. Experiments on several benchmark datasets
demonstrate its advantages.
Related papers
- Making Large Vision Language Models to be Good Few-shot Learners [11.204701216476815]
Few-shot classification (FSC) is a fundamental yet challenging task in computer vision.
LVLMs risk learning specific response formats rather than effectively extracting useful information from support data.
In this paper, we investigate LVLMs' performance in FSC and identify key issues such as insufficient learning and the presence of severe positional biases.
arXiv Detail & Related papers (2024-08-21T03:01:11Z) - Response Wide Shut: Surprising Observations in Basic Vision Language Model Capabilities [30.176918208200604]
Vision-Language Models (VLMs) have emerged as general purpose tools for addressing a variety of complex computer vision problems.
These models have been shown to be highly capable, but also lacking some basic visual understanding skills.
This paper sets out to understand the limitations of SoTA VLMs on fundamental visual tasks.
arXiv Detail & Related papers (2024-08-13T08:26:32Z) - Enhancing Visual Document Understanding with Contrastive Learning in
Large Visual-Language Models [56.76307866160105]
We propose a contrastive learning framework, termed Document Object COntrastive learning (DoCo)
DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of Large Visual-Language Models (LVLMs)
We demonstrate that the proposed DoCo serves as a plug-and-play pre-training method, which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process.
arXiv Detail & Related papers (2024-02-29T10:17:27Z) - SignVTCL: Multi-Modal Continuous Sign Language Recognition Enhanced by
Visual-Textual Contrastive Learning [51.800031281177105]
SignVTCL is a continuous sign language recognition framework enhanced by visual-textual contrastive learning.
It integrates multi-modal data (video, keypoints, and optical flow) simultaneously to train a unified visual backbone.
It achieves state-of-the-art results compared with previous methods.
arXiv Detail & Related papers (2024-01-22T11:04:55Z) - Multi-View Class Incremental Learning [57.14644913531313]
Multi-view learning (MVL) has gained great success in integrating information from multiple perspectives of a dataset to improve downstream task performance.
This paper investigates a novel paradigm called multi-view class incremental learning (MVCIL), where a single model incrementally classifies new classes from a continual stream of views.
arXiv Detail & Related papers (2023-06-16T08:13:41Z) - Uncovering the Hidden Dynamics of Video Self-supervised Learning under
Distribution Shifts [39.080610060557476]
We study the behavior of six popular self-supervised methods (v-SimCLR, v-MoCo, v-BYOL, v-SimSiam, v-DINO, v-MAE) in response to various forms of natural distribution shift.
Our study uncovers a series of intriguing findings and interesting behaviors of VSSL methods.
arXiv Detail & Related papers (2023-06-03T06:10:20Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - A Survey on Masked Autoencoder for Self-supervised Learning in Vision
and Beyond [64.85076239939336]
Self-supervised learning (SSL) in vision might undertake a similar trajectory as in NLP.
generative pretext tasks with the masked prediction (e.g., BERT) have become a de facto standard SSL practice in NLP.
Success of mask image modeling has revived the masking autoencoder.
arXiv Detail & Related papers (2022-07-30T09:59:28Z) - Exploit Clues from Views: Self-Supervised and Regularized Learning for
Multiview Object Recognition [66.87417785210772]
This work investigates the problem of multiview self-supervised learning (MV-SSL)
A novel surrogate task for self-supervised learning is proposed by pursuing "object invariant" representation.
Experiments shows that the recognition and retrieval results using view invariant prototype embedding (VISPE) outperform other self-supervised learning methods.
arXiv Detail & Related papers (2020-03-28T07:06:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.