Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection
- URL: http://arxiv.org/abs/2205.09613v1
- Date: Thu, 19 May 2022 15:11:20 GMT
- Title: Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection
- Authors: Xiaosong Zhang, Feng Liu, Zhiliang Peng, Zonghao Guo, Fang Wan,
Xiangyang Ji, Qixiang Ye
- Abstract summary: imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
- Score: 78.2325219839805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern object detectors have taken the advantages of pre-trained vision
transformers by using them as backbone networks. However, except for the
backbone networks, other detector components, such as the detector head and the
feature pyramid network, remain randomly initialized, which hinders the
consistency between detectors and pre-trained models. In this study, we propose
to integrally migrate the pre-trained transformer encoder-decoders (imTED) for
object detection, constructing a feature extraction-operation path that is not
only "fully pre-trained" but also consistent with pre-trained models. The
essential improvements of imTED over existing transformer-based detectors are
twofold: (1) it embeds the pre-trained transformer decoder to the detector
head; and (2) it removes the feature pyramid network from the feature
extraction path. Such improvements significantly reduce the proportion of
randomly initialized parameters and enhance the generation capability of
detectors. Experiments on MS COCO dataset demonstrate that imTED consistently
outperforms its counterparts by ~2.8% AP. Without bells and whistles, imTED
improves the state-of-the-art of few-shot object detection by up to 7.6% AP,
demonstrating significantly higher generalization capability. Code will be made
publicly available.
Related papers
- Towards Efficient Use of Multi-Scale Features in Transformer-Based
Object Detectors [49.83396285177385]
Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs.
We propose Iterative Multi-scale Feature Aggregation (IMFA) -- a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors.
arXiv Detail & Related papers (2022-08-24T08:09:25Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - BTranspose: Bottleneck Transformers for Human Pose Estimation with
Self-Supervised Pre-Training [0.304585143845864]
In this paper, we consider the recently proposed Bottleneck Transformers, which combine CNN and multi-head self attention (MHSA) layers effectively.
We consider different backbone architectures and pre-train them using the DINO self-supervised learning method.
Experiments show that our model achieves an AP of 76.4, which is competitive with other methods such as [1] and has fewer network parameters.
arXiv Detail & Related papers (2022-04-21T15:45:05Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.