DropPos: Pre-Training Vision Transformers by Reconstructing Dropped
Positions
- URL: http://arxiv.org/abs/2309.03576v2
- Date: Fri, 22 Sep 2023 00:54:47 GMT
- Title: DropPos: Pre-Training Vision Transformers by Reconstructing Dropped
Positions
- Authors: Haochen Wang, Junsong Fan, Yuxi Wang, Kaiyou Song, Tong Wang,
Zhaoxiang Zhang
- Abstract summary: We present DropPos, a novel pretext task designed to reconstruct Dropped Positions.
The code is publicly available at https://github.com/Haochen-Wang409/DropPos.
- Score: 63.61970125369834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As it is empirically observed that Vision Transformers (ViTs) are quite
insensitive to the order of input tokens, the need for an appropriate
self-supervised pretext task that enhances the location awareness of ViTs is
becoming evident. To address this, we present DropPos, a novel pretext task
designed to reconstruct Dropped Positions. The formulation of DropPos is
simple: we first drop a large random subset of positional embeddings and then
the model classifies the actual position for each non-overlapping patch among
all possible positions solely based on their visual appearance. To avoid
trivial solutions, we increase the difficulty of this task by keeping only a
subset of patches visible. Additionally, considering there may be different
patches with similar visual appearances, we propose position smoothing and
attentive reconstruction strategies to relax this classification problem, since
it is not necessary to reconstruct their exact positions in these cases.
Empirical evaluations of DropPos show strong capabilities. DropPos outperforms
supervised pre-training and achieves competitive results compared with
state-of-the-art self-supervised alternatives on a wide range of downstream
benchmarks. This suggests that explicitly encouraging spatial reasoning
abilities, as DropPos does, indeed contributes to the improved location
awareness of ViTs. The code is publicly available at
https://github.com/Haochen-Wang409/DropPos.
Related papers
- Activating Self-Attention for Multi-Scene Absolute Pose Regression [21.164101507575186]
Multi-scene absolute pose regression addresses the demand for fast and memory-efficient camera pose estimation.
transformer encoders are underutilized due to the collapsed self-attention map.
We present an auxiliary loss that aligns queries and keys, preventing the distortion of query-key space.
arXiv Detail & Related papers (2024-11-03T06:00:36Z) - Breaking the Frame: Visual Place Recognition by Overlap Prediction [53.17564423756082]
We propose a novel visual place recognition approach based on overlap prediction, called VOP.
VOP proceeds co-visible image sections by obtaining patch-level embeddings using a Vision Transformer backbone.
Our approach uses a voting mechanism to assess overlap scores for potential database images.
arXiv Detail & Related papers (2024-06-23T20:00:20Z) - LoCUS: Learning Multiscale 3D-consistent Features from Posed Images [18.648772607057175]
We train a versatile neural representation without supervision.
We find that it is possible to balance retrieval and reusability by constructing a retrieval set carefully.
We show results creating sparse, multi-scale, semantic spatial maps.
arXiv Detail & Related papers (2023-10-02T11:11:23Z) - A Frustratingly Easy Improvement for Position Embeddings via Random
Padding [68.75670223005716]
In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to existing pre-trained language models.
Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions.
arXiv Detail & Related papers (2023-05-08T17:08:14Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - DropKey [9.846606347586906]
We focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer.
We propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit.
We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics.
arXiv Detail & Related papers (2022-08-04T13:24:04Z) - SHAPE: Shifted Absolute Position Embedding for Transformers [59.03597635990196]
Existing position representations suffer from a lack of generalization to test data with unseen lengths or high computational cost.
We investigate shifted absolute position embedding (SHAPE) to address both issues.
arXiv Detail & Related papers (2021-09-13T00:10:02Z) - Point-Set Anchors for Object Detection, Instance Segmentation and Pose
Estimation [85.96410825961966]
We argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries.
To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions.
We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation.
arXiv Detail & Related papers (2020-07-06T15:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.