Efficient Visual Pretraining with Contrastive Detection
- URL: http://arxiv.org/abs/2103.10957v1
- Date: Fri, 19 Mar 2021 14:05:12 GMT
- Title: Efficient Visual Pretraining with Contrastive Detection
- Authors: Olivier J. H\'enaff, Skanda Koppula, Jean-Baptiste Alayrac, Aaron van
den Oord, Oriol Vinyals, Jo\~ao Carreira
- Abstract summary: We introduce a new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.
This objective extracts a rich learning signal per image, leading to state-of-the-art transfer performance from ImageNet to COCO.
In particular, our strongest ImageNet-pretrained model performs on par with SEER, one of the largest self-supervised systems to date.
- Score: 31.444554574326283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-supervised pretraining has been shown to yield powerful representations
for transfer learning. These performance gains come at a large computational
cost however, with state-of-the-art methods requiring an order of magnitude
more computation than supervised pretraining. We tackle this computational
bottleneck by introducing a new self-supervised objective, contrastive
detection, which tasks representations with identifying object-level features
across augmentations. This objective extracts a rich learning signal per image,
leading to state-of-the-art transfer performance from ImageNet to COCO, while
requiring up to 5x less pretraining. In particular, our strongest
ImageNet-pretrained model performs on par with SEER, one of the largest
self-supervised systems to date, which uses 1000x more pretraining data.
Finally, our objective seamlessly handles pretraining on more complex images
such as those in COCO, closing the gap with supervised transfer learning from
COCO to PASCAL.
Related papers
- Self-Supervised Pretraining for 2D Medical Image Segmentation [0.0]
Self-supervised learning offers a way to lower the need for manually annotated data by pretraining models for a specific domain on unlabelled data.
We find that self-supervised pretraining on natural images and target-domain-specific images leads to the fastest and most stable downstream convergence.
In low-data scenarios, supervised ImageNet pretraining achieves the best accuracy, requiring less than 100 annotated samples to realise close to minimal error.
arXiv Detail & Related papers (2022-09-01T09:25:22Z) - Are Large-scale Datasets Necessary for Self-Supervised Pre-training? [29.49873710927313]
We consider a self-supervised pre-training scenario that only leverages the target task data.
Our study shows that denoising autoencoders, such as BEiT, are more robust to the type and size of the pre-training data.
On COCO, when pre-training solely using COCO images, the detection and instance segmentation performance surpasses the supervised ImageNet pre-training in a comparable setting.
arXiv Detail & Related papers (2021-12-20T18:41:32Z) - On Efficient Transformer and Image Pre-training for Low-level Vision [74.22436001426517]
Pre-training has marked numerous state of the arts in high-level computer vision.
We present an in-depth study of image pre-training.
We find pre-training plays strikingly different roles in low-level tasks.
arXiv Detail & Related papers (2021-12-19T15:50:48Z) - Unsupervised Object-Level Representation Learning from Scene Images [97.07686358706397]
Object-level Representation Learning (ORL) is a new self-supervised learning framework towards scene images.
Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence.
ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks.
arXiv Detail & Related papers (2021-06-22T17:51:24Z) - Self-Supervised Pretraining Improves Self-Supervised Pretraining [83.1423204498361]
Self-supervised pretraining requires expensive and lengthy computation, large amounts of data, and is sensitive to data augmentation.
This paper explores Hierarchical PreTraining (HPT), which decreases convergence time and improves accuracy by initializing the pretraining process with an existing pretrained model.
We show HPT converges up to 80x faster, improves accuracy across tasks, and improves the robustness of the self-supervised pretraining process to changes in the image augmentation policy or amount of pretraining data.
arXiv Detail & Related papers (2021-03-23T17:37:51Z) - Efficient Conditional Pre-training for Transfer Learning [71.01129334495553]
We propose efficient filtering methods to select relevant subsets from the pre-training dataset.
We validate our techniques by pre-training on ImageNet in both the unsupervised and supervised settings.
We improve standard ImageNet pre-training by 1-3% by tuning available models on our subsets and pre-training on a dataset filtered from a larger scale dataset.
arXiv Detail & Related papers (2020-11-20T06:16:15Z) - Unsupervised Vision-and-Language Pre-training Without Parallel Images
and Captions [92.47566804182338]
We investigate if a strong V&L representation model can be learned through unsupervised pre-training without image-caption corpora.
In particular, we propose to conduct mask-and-predict'' pre-training on text-only and image-only corpora.
We find that such a simple approach performance close to a model pre-trained with aligned data, on four English V&L benchmarks.
arXiv Detail & Related papers (2020-10-24T08:17:54Z) - Supervision Accelerates Pre-training in Contrastive Semi-Supervised
Learning of Visual Representations [12.755943669814236]
We propose a semi-supervised loss, SuNCEt, that aims to distinguish examples of different classes in addition to self-supervised instance-wise pretext tasks.
On ImageNet, we find that SuNCEt can be used to match the semi-supervised learning accuracy of previous contrastive approaches.
Our main insight is that leveraging even a small amount of labeled data during pre-training, and not only during fine-tuning, provides an important signal.
arXiv Detail & Related papers (2020-06-18T18:44:13Z) - Cheaper Pre-training Lunch: An Efficient Paradigm for Object Detection [86.0580214485104]
We propose a general and efficient pre-training paradigm, Montage pre-training, for object detection.
Montage pre-training needs only the target detection dataset while taking only 1/4 computational resources compared to the widely adopted ImageNet pre-training.
The efficiency and effectiveness of Montage pre-training are validated by extensive experiments on the MS-COCO dataset.
arXiv Detail & Related papers (2020-04-25T16:09:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.