Exploring the Diversity and Invariance in Yourself for Visual
Pre-Training Task
- URL: http://arxiv.org/abs/2106.00537v1
- Date: Tue, 1 Jun 2021 14:52:36 GMT
- Title: Exploring the Diversity and Invariance in Yourself for Visual
Pre-Training Task
- Authors: Longhui Wei, Lingxi Xie, Wengang Zhou, Houqiang Li, Qi Tian
- Abstract summary: Self-supervised learning methods have achieved remarkable success in visual pre-training task.
These methods only focus on limited regions or the extracted features on totally different regions inside each image are nearly the same.
This paper introduces Exploring the Diversity and Invariance in Yourself E-DIY.
- Score: 192.74445148376037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, self-supervised learning methods have achieved remarkable success
in visual pre-training task. By simply pulling the different augmented views of
each image together or other novel mechanisms, they can learn much unsupervised
knowledge and significantly improve the transfer performance of pre-training
models. However, these works still cannot avoid the representation collapse
problem, i.e., they only focus on limited regions or the extracted features on
totally different regions inside each image are nearly the same. Generally,
this problem makes the pre-training models cannot sufficiently describe the
multi-grained information inside images, which further limits the upper bound
of their transfer performance. To alleviate this issue, this paper introduces a
simple but effective mechanism, called Exploring the Diversity and Invariance
in Yourself E-DIY. By simply pushing the most different regions inside each
augmented view away, E-DIY can preserve the diversity of extracted region-level
features. By pulling the most similar regions from different augmented views of
the same image together, E-DIY can ensure the robustness of region-level
features. Benefited from the above diversity and invariance exploring
mechanism, E-DIY maximally extracts the multi-grained visual information inside
each image. Extensive experiments on downstream tasks demonstrate the
superiority of our proposed approach, e.g., there are 2.1% improvements
compared with the strong baseline BYOL on COCO while fine-tuning Mask R-CNN
with the R50-C4 backbone and 1X learning schedule.
Related papers
- CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition.
Our method uses the attention mechanism to correlate multiple images within a batch.
Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z) - Mutual Distillation Learning For Person Re-Identification [27.350415735863184]
We propose a novel approach, Mutual Distillation Learning For Person Re-identification (termed as MDPR)
Our approach encompasses two branches: a hard content branch to extract local features via a uniform horizontal partitioning strategy and a Soft Content Branch to dynamically distinguish between foreground and background.
Our method achieves an impressive $88.7%/94.4%$ in mAP/Rank-1 on the DukeC-reID dataset, surpassing the current state-of-the-art results.
arXiv Detail & Related papers (2024-01-12T07:49:02Z) - Patch-Wise Self-Supervised Visual Representation Learning: A Fine-Grained Approach [4.9204263448542465]
This study introduces an innovative, fine-grained dimension by integrating patch-level discrimination into self-supervised visual representation learning.
We employ a distinctive photometric patch-level augmentation, where each patch is individually augmented, independent from other patches within the same view.
We present a simple yet effective patch-matching algorithm to find the corresponding patches across the augmented views.
arXiv Detail & Related papers (2023-10-28T09:35:30Z) - Extending global-local view alignment for self-supervised learning with remote sensing imagery [1.5192294544599656]
Self-supervised models acquire general feature representations by formulating a pretext task that generates pseudo-labels for massive unlabeled data.
Inspired by DINO, we formulate two pretext tasks for self-supervised learning on remote sensing imagery (SSLRS)
We extend DINO and propose DINO-MC which uses local views of various sized crops instead of a single fixed size.
arXiv Detail & Related papers (2023-03-12T14:24:10Z) - Mugs: A Multi-Granular Self-Supervised Learning Framework [114.34858365121725]
We propose an effective MUlti-Granular Self-supervised learning (Mugs) framework to explicitly learn multi-granular visual features.
Mugs has three complementary granular supervisions: 1) an instance discrimination supervision (IDS), 2) a novel local-group discrimination supervision (LGDS), and 3) a group discrimination supervision (GDS)
arXiv Detail & Related papers (2022-03-27T23:42:05Z) - X-Learner: Learning Cross Sources and Tasks for Universal Visual
Representation [71.51719469058666]
We propose a representation learning framework called X-Learner.
X-Learner learns the universal feature of multiple vision tasks supervised by various sources.
X-Learner achieves strong performance on different tasks without extra annotations, modalities and computational costs.
arXiv Detail & Related papers (2022-03-16T17:23:26Z) - RegionCL: Can Simple Region Swapping Contribute to Contrastive Learning? [76.16156833138038]
We propose a simple yet effective pretext task called Region Contrastive Learning (RegionCL)
Specifically, given two different images, we randomly crop a region from each image with the same size and swap them to compose two new images together with the left regions.
RegionCL exploits those abundant pairs and helps the model distinguish the regions features from both canvas and paste views.
arXiv Detail & Related papers (2021-11-24T07:19:46Z) - Progressive Multi-stage Feature Mix for Person Re-Identification [11.161336369536818]
CNN suffers from paying too much attention on the most salient local areas.
%BDB proposes to randomly drop one block in a batch to enlarge the high response areas.
We propose a Progressive Multi-stage feature Mix network (PMM), which enables the model to find out the more precise and diverse features in a progressive manner.
arXiv Detail & Related papers (2020-07-17T06:59:39Z) - Diversity Helps: Unsupervised Few-shot Learning via Distribution
Shift-based Data Augmentation [21.16237189370515]
Few-shot learning aims to learn a new concept when only a few training examples are available.
In this paper, we develop a novel framework called Unsupervised Few-shot Learning via Distribution Shift-based Data Augmentation.
In experiments, few-shot models learned by ULDA can achieve superior generalization performance.
arXiv Detail & Related papers (2020-04-13T07:41:56Z) - Attentive CutMix: An Enhanced Data Augmentation Approach for Deep
Learning Based Image Classification [58.20132466198622]
We propose Attentive CutMix, a naturally enhanced augmentation strategy based on CutMix.
In each training iteration, we choose the most descriptive regions based on the intermediate attention maps from a feature extractor.
Our proposed method is simple yet effective, easy to implement and can boost the baseline significantly.
arXiv Detail & Related papers (2020-03-29T15:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.