Sparse DETR: Efficient End-to-End Object Detection with Learnable
Sparsity
- URL: http://arxiv.org/abs/2111.14330v1
- Date: Mon, 29 Nov 2021 05:22:46 GMT
- Title: Sparse DETR: Efficient End-to-End Object Detection with Learnable
Sparsity
- Authors: Byungseok Roh, JaeWoong Shin, Wuhyun Shin, Saehoon Kim
- Abstract summary: We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset.
Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
- Score: 10.098578160958946
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: DETR is the first end-to-end object detector using a transformer
encoder-decoder architecture and demonstrates competitive performance but low
computational efficiency on high resolution feature maps. The subsequent work,
Deformable DETR, enhances the efficiency of DETR by replacing dense attention
with deformable attention, which achieves 10x faster convergence and improved
performance. Deformable DETR uses the multiscale feature to ameliorate
performance, however, the number of encoder tokens increases by 20x compared to
DETR, and the computation cost of the encoder attention remains a bottleneck.
In our preliminary experiment, we observe that the detection performance hardly
deteriorates even if only a part of the encoder token is updated. Inspired by
this observation, we propose Sparse DETR that selectively updates only the
tokens expected to be referenced by the decoder, thus help the model
effectively detect objects. In addition, we show that applying an auxiliary
detection loss on the selected tokens in the encoder improves the performance
while minimizing computational overhead. We validate that Sparse DETR achieves
better performance than Deformable DETR even with only 10% encoder tokens on
the COCO dataset. Albeit only the encoder tokens are sparsified, the total
computation cost decreases by 38% and the frames per second (FPS) increases by
42% compared to Deformable DETR.
Code is available at https://github.com/kakaobrain/sparse-detr
Related papers
- LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Less is More: Focus Attention for Efficient DETR [23.81282650112188]
We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy.
Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism.
Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO.
arXiv Detail & Related papers (2023-07-24T08:39:11Z) - Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR [27.120786736090842]
We present Lite DETR, a simple yet efficient end-to-end object detection framework.
We design an efficient encoder block to update high-level features and low-level features.
To better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence.
We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders.
Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z) - Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS.
Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point.
We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - D^2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale
Attention [27.354159713970322]
We propose a decoder-only detector called D2ETR.
In the absence of encoder, the decoder directly attends to the fine-fused feature maps generated by the Transformer backbone.
D2ETR demonstrates low computational complexity and high detection accuracy in evaluations on the COCO benchmark.
arXiv Detail & Related papers (2022-03-02T04:21:12Z) - Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module.
Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.