VICRegL: Self-Supervised Learning of Local Visual Features
- URL: http://arxiv.org/abs/2210.01571v1
- Date: Tue, 4 Oct 2022 12:54:25 GMT
- Title: VICRegL: Self-Supervised Learning of Local Visual Features
- Authors: Adrien Bardes and Jean Ponce and Yann LeCun
- Abstract summary: This paper explores the fundamental trade-off between learning local and global features.
A new method called VICRegL is proposed that learns good global and local features simultaneously.
We demonstrate strong performance on linear classification and segmentation transfer tasks.
- Score: 34.92750644059916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most recent self-supervised methods for learning image representations focus
on either producing a global feature with invariance properties, or producing a
set of local features. The former works best for classification tasks while the
latter is best for detection and segmentation tasks. This paper explores the
fundamental trade-off between learning local and global features. A new method
called VICRegL is proposed that learns good global and local features
simultaneously, yielding excellent performance on detection and segmentation
tasks while maintaining good performance on classification tasks. Concretely,
two identical branches of a standard convolutional net architecture are fed two
differently distorted versions of the same image. The VICReg criterion is
applied to pairs of global feature vectors. Simultaneously, the VICReg
criterion is applied to pairs of local feature vectors occurring before the
last pooling layer. Two local feature vectors are attracted to each other if
their l2-distance is below a threshold or if their relative locations are
consistent with a known geometric transformation between the two input images.
We demonstrate strong performance on linear classification and segmentation
transfer tasks. Code and pretrained models are publicly available at:
https://github.com/facebookresearch/VICRegL
Related papers
- Siamese Transformer Networks for Few-shot Image Classification [9.55588609556447]
Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples.
Existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both.
We propose a novel approach based on the Siamese Transformer Network (STN)
Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules.
arXiv Detail & Related papers (2024-07-16T14:27:23Z) - AANet: Aggregation and Alignment Network with Semi-hard Positive Sample
Mining for Hierarchical Place Recognition [48.043749855085025]
Visual place recognition (VPR) is one of the research hotspots in robotics, which uses visual information to locate robots.
We present a unified network capable of extracting global features for retrieving candidates via an aggregation module.
We also propose a Semi-hard Positive Sample Mining (ShPSM) strategy to select appropriate hard positive images for training more robust VPR networks.
arXiv Detail & Related papers (2023-10-08T14:46:11Z) - High-fidelity Pseudo-labels for Boosting Weakly-Supervised Segmentation [17.804090651425955]
Image-level weakly-supervised segmentation (WSSS) reduces the usually vast data annotation cost by surrogate segmentation masks during training.
Our work is based on two techniques for improving CAMs; importance sampling, which is a substitute for GAP, and the feature similarity loss.
We reformulate both techniques based on binomial posteriors of multiple independent binary problems.
This has two benefits; their performance is improved and they become more general, resulting in an add-on method that can boost virtually any WSSS method.
arXiv Detail & Related papers (2023-04-05T17:43:57Z) - Learning Implicit Feature Alignment Function for Semantic Segmentation [51.36809814890326]
Implicit Feature Alignment function (IFA) is inspired by the rapidly expanding topic of implicit neural representations.
We show that IFA implicitly aligns the feature maps at different levels and is capable of producing segmentation maps in arbitrary resolutions.
Our method can be combined with improvement on various architectures, and it achieves state-of-the-art accuracy trade-off on common benchmarks.
arXiv Detail & Related papers (2022-06-17T09:40:14Z) - A Hierarchical Dual Model of Environment- and Place-Specific Utility for
Visual Place Recognition [26.845945347572446]
We present a novel approach to deduce two key types of utility for Visual Place Recognition (VPR)
We employ contrastive learning principles to estimate both the environment- and place-specific utility of Vector of Locally Aggregated Descriptors (VLAD) clusters.
By combining these two utility measures, our approach achieves state-of-the-art performance on three challenging benchmark datasets.
arXiv Detail & Related papers (2021-07-06T07:38:47Z) - Conformer: Local Features Coupling Global Representations for Visual
Recognition [72.9550481476101]
We propose a hybrid network structure, termed Conformer, to take advantage of convolutional operations and self-attention mechanisms for enhanced representation learning.
Experiments show that Conformer, under the comparable parameter complexity, outperforms the visual transformer (DeiT-B) by 2.3% on ImageNet.
arXiv Detail & Related papers (2021-05-09T10:00:03Z) - Region Similarity Representation Learning [94.88055458257081]
Region Similarity Representation Learning (ReSim) is a new approach to self-supervised representation learning for localization-based tasks.
ReSim learns both regional representations for localization as well as semantic image-level representations.
We show how ReSim learns representations which significantly improve the localization and classification performance compared to a competitive MoCo-v2 baseline.
arXiv Detail & Related papers (2021-03-24T00:42:37Z) - Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision.
We propose to leverage pixel-level similarities across different objects for learning more accurate object locations.
Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z) - Fine-Grained Visual Classification with Efficient End-to-end
Localization [49.9887676289364]
We present an efficient localization module that can be fused with a classification network in an end-to-end setup.
We evaluate the new model on the three benchmark datasets CUB200-2011, Stanford Cars and FGVC-Aircraft.
arXiv Detail & Related papers (2020-05-11T14:07:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.