ViDT: An Efficient and Effective Fully Transformer-based Object Detector
- URL: http://arxiv.org/abs/2110.03921v1
- Date: Fri, 8 Oct 2021 06:32:05 GMT
- Title: ViDT: An Efficient and Effective Fully Transformer-based Object Detector
- Authors: Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han,
Byeongho Heo, Wonjae Kim, Ming-Hsuan Yang
- Abstract summary: Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
- Score: 97.71746903042968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers are transforming the landscape of computer vision, especially
for recognition tasks. Detection transformers are the first fully end-to-end
learning systems for object detection, while vision transformers are the first
fully transformer-based architecture for image classification. In this paper,
we integrate Vision and Detection Transformers (ViDT) to build an effective and
efficient object detector. ViDT introduces a reconfigured attention module to
extend the recent Swin Transformer to be a standalone object detector, followed
by a computationally efficient transformer decoder that exploits multi-scale
features and auxiliary techniques essential to boost the detection performance
without much increase in computational load. Extensive evaluation results on
the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP
and latency trade-off among existing fully transformer-based object detectors,
and achieves 49.2AP owing to its high scalability for large models. We will
release the code and trained models athttps://github.com/naver-ai/vidt
Related papers
- Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - HiViT: Hierarchical Vision Transformer Meets Masked Image Modeling [126.89573619301953]
We propose a new design of hierarchical vision transformers named HiViT (short for Hierarchical ViT)
HiViT enjoys both high efficiency and good performance in MIM.
In running MAE on ImageNet-1K, HiViT-B reports a +0.6% accuracy gain over ViT-B and a 1.9$times$ speed-up over Swin-B.
arXiv Detail & Related papers (2022-05-30T09:34:44Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - Searching Intrinsic Dimensions of Vision Transformers [6.004704152622424]
We propose SiDT, a method for pruning vision transformer backbones on more complicated vision tasks like object detection.
Experiments on CIFAR-100 and COCO datasets show that the backbones with 20% or 40% dimensions/ parameters pruned can have similar or even better performance than the unpruned models.
arXiv Detail & Related papers (2022-04-16T05:16:35Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot
MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD)
Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.