Related papers: Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity

URL: http://arxiv.org/abs/2111.14330v1
Date: Mon, 29 Nov 2021 05:22:46 GMT
Title: Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity
Authors: Byungseok Roh, JaeWoong Shin, Wuhyun Shin, Saehoon Kim
Abstract summary: We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
Score: 10.098578160958946
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: DETR is the first end-to-end object detector using a transformer encoder-decoder architecture and demonstrates competitive performance but low computational efficiency on high resolution feature maps. The subsequent work, Deformable DETR, enhances the efficiency of DETR by replacing dense attention with deformable attention, which achieves 10x faster convergence and improved performance. Deformable DETR uses the multiscale feature to ameliorate performance, however, the number of encoder tokens increases by 20x compared to DETR, and the computation cost of the encoder attention remains a bottleneck. In our preliminary experiment, we observe that the detection performance hardly deteriorates even if only a part of the encoder token is updated. Inspired by this observation, we propose Sparse DETR that selectively updates only the tokens expected to be referenced by the decoder, thus help the model effectively detect objects. In addition, we show that applying an auxiliary detection loss on the selected tokens in the encoder improves the performance while minimizing computational overhead. We validate that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR. Code is available at https://github.com/kakaobrain/sparse-detr

Related papers

LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z)
Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions. Mask2Former uses 50% of its compute only on the transformer encoder. This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer. We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z)
Extreme Encoder Output Frame Rate Reduction: Improving Computational Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames. We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z)
Less is More: Focus Attention for Efficient DETR [23.81282650112188]
We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO.
arXiv Detail & Related papers (2023-07-24T08:39:11Z)
Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR [27.120786736090842]
We present Lite DETR, a simple yet efficient end-to-end object detection framework. We design an efficient encoder block to update high-level features and low-level features. To better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights.
arXiv Detail & Related papers (2023-03-13T17:57:59Z)
Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence. We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z)
Conditional DETR V2: Efficient Detection Transformer with Box Queries [58.9706842210695]
We are interested in an end-to-end object detection approach based on a transformer encoder-decoder architecture without hand-crafted postprocessing, such as NMS. Inspired by Conditional DETR, an improved DETR with fast training convergence, we reformulate the object query into the format of the box query that is a composition of the embeddings of the reference point. We learn the box queries from the image content, further improving the detection quality of Conditional DETR still with fast training convergence.
arXiv Detail & Related papers (2022-07-18T20:08:55Z)
Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches. We propose a decoder-free fully transformer-based (DFFT) object detector. DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z)
D^2ETR: Decoder-Only DETR with Computationally Efficient Cross-Scale Attention [27.354159713970322]
We propose a decoder-only detector called D2ETR. In the absence of encoder, the decoder directly attends to the fine-fused feature maps generated by the Transformer backbone. D2ETR demonstrates low computational complexity and high detection accuracy in evaluations on the COCO benchmark.
arXiv Detail & Related papers (2022-03-02T04:21:12Z)
Non-Autoregressive Transformer ASR with CTC-Enhanced Decoder Input [54.82369261350497]
We propose a CTC-enhanced NAR transformer, which generates target sequence by refining predictions of the CTC module. Experimental results show that our method outperforms all previous NAR counterparts and achieves 50x faster decoding speed than a strong AR baseline with only 0.0 0.3 absolute CER degradation on Aishell-1 and Aishell-2 datasets.
arXiv Detail & Related papers (2020-10-28T15:00:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.