Unleashing Vanilla Vision Transformer with Masked Image Modeling for
Object Detection
- URL: http://arxiv.org/abs/2204.02964v1
- Date: Wed, 6 Apr 2022 17:59:04 GMT
- Title: Unleashing Vanilla Vision Transformer with Masked Image Modeling for
Object Detection
- Authors: Yuxin Fang, Shusheng Yang, Shijie Wang, Yixiao Ge, Ying Shan, Xinggang
Wang
- Abstract summary: A MIM pre-trained vanilla ViT can work surprisingly well in the challenging object-level recognition scenario.
A random compact convolutional stem supplants the pre-trained large kernel patchify stem.
The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on.
- Score: 39.37861288287621
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an approach to efficiently and effectively adapt a masked image
modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object
detection, which is based on our two novel observations: (i) A MIM pre-trained
vanilla ViT can work surprisingly well in the challenging object-level
recognition scenario even with random sampled partial observations, e.g., only
25% ~ 50% of the input sequence. (ii) In order to construct multi-scale
representations for object detection, a random initialized compact
convolutional stem supplants the pre-trained large kernel patchify stem, and
its intermediate features can naturally serve as the higher resolution inputs
of a feature pyramid without upsampling. While the pre-trained ViT is only
regarded as the third-stage of our detector's backbone instead of the whole
feature extractor, resulting in a ConvNet-ViT hybrid architecture. The proposed
detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform
hierarchical Swin Transformer by 2.3 box AP and 2.5 mask AP on COCO, and
achieve even better results compared with other adapted vanilla ViT using a
more modest fine-tuning recipe while converging 2.8x faster. Code and
pre-trained models are available at \url{https://github.com/hustvl/MIMDet}.
Related papers
- MoE-FFD: Mixture of Experts for Generalized and Parameter-Efficient Face Forgery Detection [54.545054873239295]
Deepfakes have recently raised significant trust issues and security concerns among the public.
ViT-based methods take advantage of the expressivity of transformers, achieving superior detection performance.
This work introduces Mixture-of-Experts modules for Face Forgery Detection (MoE-FFD), a generalized yet parameter-efficient ViT-based approach.
arXiv Detail & Related papers (2024-04-12T13:02:08Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.