An Extendable, Efficient and Effective Transformer-based Object Detector
- URL: http://arxiv.org/abs/2204.07962v1
- Date: Sun, 17 Apr 2022 09:27:45 GMT
- Title: An Extendable, Efficient and Effective Transformer-based Object Detector
- Authors: Hwanjun Song, Deqing Sun, Sanghyuk Chun, Varun Jampani, Dongyoon Han,
Byeongho Heo, Wonjae Kim, Ming-Hsuan Yang
- Abstract summary: We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
- Score: 95.06044204961009
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have been widely used in numerous vision problems especially for
visual recognition and detection. Detection transformers are the first fully
end-to-end learning systems for object detection, while vision transformers are
the first fully transformer-based architecture for image classification. In
this paper, we integrate Vision and Detection Transformers (ViDT) to construct
an effective and efficient object detector. ViDT introduces a reconfigured
attention module to extend the recent Swin Transformer to be a standalone
object detector, followed by a computationally efficient transformer decoder
that exploits multi-scale features and auxiliary techniques essential to boost
the detection performance without much increase in computational load. In
addition, we extend it to ViDT+ to support joint-task learning for object
detection and instance segmentation. Specifically, we attach an efficient
multi-scale feature fusion layer and utilize two more auxiliary training
losses, IoU-aware loss and token labeling loss. Extensive evaluation results on
the Microsoft COCO benchmark dataset demonstrate that ViDT obtains the best AP
and latency trade-off among existing fully transformer-based object detectors,
and its extended ViDT+ achieves 53.2AP owing to its high scalability for large
models. The source code and trained models are available at
https://github.com/naver-ai/vidt.
Related papers
- Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - Towards Efficient Use of Multi-Scale Features in Transformer-Based
Object Detectors [49.83396285177385]
Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs.
We propose Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors.
arXiv Detail & Related papers (2022-08-24T08:09:25Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer [41.44769642537572]
Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs.
We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-03T10:52:06Z) - CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot
MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD)
Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.