Self-Supervised Pre-training of Vision Transformers for Dense Prediction
Tasks
- URL: http://arxiv.org/abs/2205.15173v1
- Date: Mon, 30 May 2022 15:25:37 GMT
- Title: Self-Supervised Pre-training of Vision Transformers for Dense Prediction
Tasks
- Authors: Jaonary Rabarisoa, Velentin Belissen, Florian Chabot, Quoc-Cuong Pham
- Abstract summary: We present a new self-supervised pre-training of Vision Transformers for dense prediction tasks.
Our strategy produces better local features suitable for dense prediction tasks as opposed to contrastive pre-training based on global image representation only.
- Score: 2.160196691362033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a new self-supervised pre-training of Vision Transformers for
dense prediction tasks. It is based on a contrastive loss across views that
compares pixel-level representations to global image representations. This
strategy produces better local features suitable for dense prediction tasks as
opposed to contrastive pre-training based on global image representation only.
Furthermore, our approach does not suffer from a reduced batch size since the
number of negative examples needed in the contrastive loss is in the order of
the number of local features. We demonstrate the effectiveness of our
pre-training strategy on two dense prediction tasks: semantic segmentation and
monocular depth estimation.
Related papers
- Enhancing 3D Transformer Segmentation Model for Medical Image with Token-level Representation Learning [9.896550384001348]
This work proposes a token-level representation learning loss that maximizes agreement between token embeddings from different augmented views individually.
We also invent a simple "rotate-and-restore" mechanism, which rotates and flips one augmented view of input volume, and later restores the order of tokens in the feature maps.
We test our pre-training scheme on two public medical segmentation datasets, and the results on the downstream segmentation task show more improvement of our methods than other state-of-the-art pre-trainig methods.
arXiv Detail & Related papers (2024-08-12T01:49:13Z) - Rethinking Visual Content Refinement in Low-Shot CLIP Adaptation [31.023236232633213]
Recent adaptations can boost the low-shot capability of Contrastive Vision-Language Pre-training.
We propose a Visual Content Refinement (VCR) before the adaptation calculation during the test stage.
We apply our method to 3 popular low-shot benchmark tasks with 13 datasets and achieve a significant improvement over state-of-the-art methods.
arXiv Detail & Related papers (2024-07-19T08:34:23Z) - Exploiting Diffusion Prior for Generalizable Dense Prediction [85.4563592053464]
Recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate.
We introduce DMP, a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks.
Despite limited-domain training data, the approach yields faithful estimations for arbitrary images, surpassing existing state-of-the-art algorithms.
arXiv Detail & Related papers (2023-11-30T18:59:44Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - Patch-level Gaze Distribution Prediction for Gaze Following [49.93340533068501]
We introduce the patch distribution prediction ( PDP) method for gaze following training.
We show that our model regularizes the MSE loss by predicting better heatmap distributions on images with larger annotation variances.
Experiments show that our model bridging the gap between the target prediction and in/out prediction subtasks, showing a significant improvement on both subtasks on public gaze following datasets.
arXiv Detail & Related papers (2022-11-20T19:25:15Z) - Generalizing Interactive Backpropagating Refinement for Dense Prediction [0.0]
We introduce a set of G-BRS layers that enable both global and localized refinement for a range of dense prediction tasks.
Our method can successfully generalize and significantly improve performance of existing pretrained state-of-the-art models with only a few clicks.
arXiv Detail & Related papers (2021-12-21T03:52:08Z) - On Efficient Transformer and Image Pre-training for Low-level Vision [74.22436001426517]
Pre-training has marked numerous state of the arts in high-level computer vision.
We present an in-depth study of image pre-training.
We find pre-training plays strikingly different roles in low-level tasks.
arXiv Detail & Related papers (2021-12-19T15:50:48Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Adversarial Semantic Data Augmentation for Human Pose Estimation [96.75411357541438]
We propose Semantic Data Augmentation (SDA), a method that augments images by pasting segmented body parts with various semantic granularity.
We also propose Adversarial Semantic Data Augmentation (ASDA), which exploits a generative network to dynamiclly predict tailored pasting configuration.
State-of-the-art results are achieved on challenging benchmarks.
arXiv Detail & Related papers (2020-08-03T07:56:04Z) - Supervision Accelerates Pre-training in Contrastive Semi-Supervised
Learning of Visual Representations [12.755943669814236]
We propose a semi-supervised loss, SuNCEt, that aims to distinguish examples of different classes in addition to self-supervised instance-wise pretext tasks.
On ImageNet, we find that SuNCEt can be used to match the semi-supervised learning accuracy of previous contrastive approaches.
Our main insight is that leveraging even a small amount of labeled data during pre-training, and not only during fine-tuning, provides an important signal.
arXiv Detail & Related papers (2020-06-18T18:44:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.