Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR
- URL: http://arxiv.org/abs/2303.07335v1
- Date: Mon, 13 Mar 2023 17:57:59 GMT
- Title: Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR
- Authors: Feng Li, Ailing Zeng, Shilong Liu, Hao Zhang, Hongyang Li, Lei Zhang,
Lionel M. Ni
- Abstract summary: We present Lite DETR, a simple yet efficient end-to-end object detection framework.
We design an efficient encoder block to update high-level features and low-level features.
To better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights.
- Score: 27.120786736090842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent DEtection TRansformer-based (DETR) models have obtained remarkable
performance. Its success cannot be achieved without the re-introduction of
multi-scale feature fusion in the encoder. However, the excessively increased
tokens in multi-scale features, especially for about 75\% of low-level
features, are quite computationally inefficient, which hinders real
applications of DETR models. In this paper, we present Lite DETR, a simple yet
efficient end-to-end object detection framework that can effectively reduce the
GFLOPs of the detection head by 60\% while keeping 99\% of the original
performance. Specifically, we design an efficient encoder block to update
high-level features (corresponding to small-resolution feature maps) and
low-level features (corresponding to large-resolution feature maps) in an
interleaved way. In addition, to better fuse cross-scale features, we develop a
key-aware deformable attention to predict more reliable attention weights.
Comprehensive experiments validate the effectiveness and efficiency of the
proposed Lite DETR, and the efficient encoder strategy can generalize well
across existing DETR-based models. The code will be available in
\url{https://github.com/IDEA-Research/Lite-DETR}.
Related papers
- Cross Resolution Encoding-Decoding For Detection Transformers [33.248031676529635]
Cross-Resolution.
Decoding (CRED) is designed to fuse multiscale.
detection mechanisms.
CRED delivers accuracy similar to the high-resolution DETR counterpart in roughly 50% fewer FLOPs.
We plan to release pretrained CRED-DETRs for use by the community.
arXiv Detail & Related papers (2024-10-05T09:01:59Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Less is More: Focus Attention for Efficient DETR [23.81282650112188]
We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy.
Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism.
Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO.
arXiv Detail & Related papers (2023-07-24T08:39:11Z) - High-level Feature Guided Decoding for Semantic Segmentation [54.424062794490254]
We propose to use powerful pre-trained high-level features as guidance (HFG) for the upsampler to produce robust results.
Specifically, the high-level features from the backbone are used to train the class tokens, which are then reused by the upsampler for classification.
To push the upper limit of HFG, we introduce a context augmentation encoder (CAE) that can efficiently and effectively operate on the low-resolution high-level feature.
arXiv Detail & Related papers (2023-03-15T14:23:07Z) - A Faster, Lighter and Stronger Deep Learning-Based Approach for Place
Recognition [7.9400442516053475]
We propose a faster, lighter and stronger approach that can generate models with fewer parameters and can spend less time in the inference stage.
We design RepVGG-lite as the backbone network in our architecture, it is more discriminative than other general networks in the Place Recognition task.
Our system has 14 times less params than Patch-NetVLAD, 6.8 times lower theoretical FLOPs, and run faster 21 and 33 times in feature extraction and feature matching.
arXiv Detail & Related papers (2022-11-27T15:46:53Z) - ETAD: A Unified Framework for Efficient Temporal Action Detection [70.21104995731085]
Untrimmed video understanding such as temporal action detection (TAD) often suffers from the pain of huge demand for computing resources.
We build a unified framework for efficient end-to-end temporal action detection (ETAD)
ETAD achieves state-of-the-art performance on both THUMOS-14 and ActivityNet-1.3.
arXiv Detail & Related papers (2022-05-14T21:16:21Z) - Sparse DETR: Efficient End-to-End Object Detection with Learnable
Sparsity [10.098578160958946]
We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset.
Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
arXiv Detail & Related papers (2021-11-29T05:22:46Z) - Lightweight Single-Image Super-Resolution Network with Attentive
Auxiliary Feature Learning [73.75457731689858]
We develop a computation efficient yet accurate network based on the proposed attentive auxiliary features (A$2$F) for SISR.
Experimental results on large-scale dataset demonstrate the effectiveness of the proposed model against the state-of-the-art (SOTA) SR methods.
arXiv Detail & Related papers (2020-11-13T06:01:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.