Learning Representations by Predicting Bags of Visual Words
- URL: http://arxiv.org/abs/2002.12247v1
- Date: Thu, 27 Feb 2020 16:45:25 GMT
- Title: Learning Representations by Predicting Bags of Visual Words
- Authors: Spyros Gidaris, Andrei Bursuc, Nikos Komodakis, Patrick P\'erez,
Matthieu Cord
- Abstract summary: Self-supervised representation learning targets to learn convnet-based image representations from unlabeled data.
Inspired by the success of NLP methods in this area, in this work we propose a self-supervised approach based on spatially dense image descriptions.
- Score: 55.332200948110895
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised representation learning targets to learn convnet-based image
representations from unlabeled data. Inspired by the success of NLP methods in
this area, in this work we propose a self-supervised approach based on
spatially dense image descriptions that encode discrete visual concepts, here
called visual words. To build such discrete representations, we quantize the
feature maps of a first pre-trained self-supervised convnet, over a k-means
based vocabulary. Then, as a self-supervised task, we train another convnet to
predict the histogram of visual words of an image (i.e., its Bag-of-Words
representation) given as input a perturbed version of that image. The proposed
task forces the convnet to learn perturbation-invariant and context-aware image
features, useful for downstream image understanding tasks. We extensively
evaluate our method and demonstrate very strong empirical results, e.g., our
pre-trained self-supervised representations transfer better on detection task
and similarly on classification over classes "unseen" during pre-training, when
compared to the supervised case.
This also shows that the process of image discretization into visual words
can provide the basis for very powerful self-supervised approaches in the image
domain, thus allowing further connections to be made to related methods from
the NLP domain that have been extremely successful so far.
Related papers
- Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - TExplain: Explaining Learned Visual Features via Pre-trained (Frozen) Language Models [14.019349267520541]
We propose a novel method that leverages the capabilities of language models to interpret the learned features of pre-trained image classifiers.
Our approach generates a vast number of sentences to explain the features learned by the classifier for a given image.
Our method, for the first time, utilizes these frequent words corresponding to a visual representation to provide insights into the decision-making process.
arXiv Detail & Related papers (2023-09-01T20:59:46Z) - Semantic-Aware Fine-Grained Correspondence [8.29030327276322]
We propose to learn semantic-aware fine-grained correspondence using image-level self-supervised methods.
We design a pixel-level self-supervised learning objective which specifically targets fine-grained correspondence.
Our method surpasses previous state-of-the-art self-supervised methods using convolutional networks on a variety of visual correspondence tasks.
arXiv Detail & Related papers (2022-07-21T12:51:41Z) - LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of
Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image.
We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion.
We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z) - Semantic-Aware Generation for Self-Supervised Visual Representation
Learning [116.5814634936371]
We advocate for Semantic-aware Generation (SaGe) to facilitate richer semantics rather than details to be preserved in the generated image.
SaGe complements the target network with view-specific features and thus alleviates the semantic degradation brought by intensive data augmentations.
We execute SaGe on ImageNet-1K and evaluate the pre-trained models on five downstream tasks including nearest neighbor test, linear classification, and fine-scaled image recognition.
arXiv Detail & Related papers (2021-11-25T16:46:13Z) - Self-supervised Product Quantization for Deep Unsupervised Image
Retrieval [21.99902461562925]
Supervised deep learning-based hash and vector quantization are enabling fast and large-scale image retrieval systems.
We propose the first deep unsupervised image retrieval method dubbed Self-supervised Product Quantization (SPQ) network, which is label-free and trained in a self-supervised manner.
Our method analyzes the image contents to extract descriptive features, allowing us to understand image representations for accurate retrieval.
arXiv Detail & Related papers (2021-09-06T05:02:34Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z) - Distilling Localization for Self-Supervised Representation Learning [82.79808902674282]
Contrastive learning has revolutionized unsupervised representation learning.
Current contrastive models are ineffective at localizing the foreground object.
We propose a data-driven approach for learning in variance to backgrounds.
arXiv Detail & Related papers (2020-04-14T16:29:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.