Related papers: Less is More: Focus Attention for Efficient DETR

Less is More: Focus Attention for Efficient DETR

URL: http://arxiv.org/abs/2307.12612v1
Date: Mon, 24 Jul 2023 08:39:11 GMT
Title: Less is More: Focus Attention for Efficient DETR
Authors: Dehua Zheng, Wenhui Dong, Hailin Hu, Xinghao Chen, Yunhe Wang
Abstract summary: We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO.
Score: 23.81282650112188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: DETR-like models have significantly boosted the performance of detectors and even outperformed classical convolutional models. However, all tokens are treated equally without discrimination brings a redundant computational burden in the traditional encoder structure. The recent sparsification strategies exploit a subset of informative tokens to reduce attention complexity maintaining performance through the sparse encoder. But these methods tend to rely on unreliable model statistics. Moreover, simply reducing the token population hinders the detection performance to a large extent, limiting the application of these sparse models. We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism that considers both localization and category semantic information of the objects from multi-scale feature maps. We efficiently abandon the background queries and enhance the semantic interaction of the fine-grained object queries based on the scores. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO. The code is available at https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR.

Related papers

A Hidden Stumbling Block in Generalized Category Discovery: Distracted Attention [3.491141037235349]
Generalized Category Discovery (GCD) aims to classify unlabeled data from both known and unknown categories.<n>Models tend to focus not only on key objects in the image but also on task-irrelevant background regions.<n>We propose Attention Focusing (AF), an adaptive mechanism designed to sharpen the model's focus by pruning non-informative tokens.
arXiv Detail & Related papers (2025-07-18T18:39:16Z)
SpirDet: Towards Efficient, Accurate and Lightweight Infrared Small Target Detector [60.42293239557962]
We propose SpirDet, a novel approach for efficient detection of infrared small targets. We employ a new dual-branch sparse decoder to restore the feature map. Extensive experiments show that the proposed SpirDet significantly outperforms state-of-the-art models.
arXiv Detail & Related papers (2024-02-08T05:06:14Z)
Accelerating the Global Aggregation of Local Explanations [43.787092409977724]
We devise techniques for accelerating the global aggregation of the Anchor algorithm. We show that for a very mild loss of quality, we are able to accelerate the computation by up to 30$times$, reducing the computation from hours to minutes.
arXiv Detail & Related papers (2023-12-13T09:03:01Z)
Unsupervised Keypoints from Pretrained Diffusion Models [31.147785019795347]
We leverage the emergent knowledge within text-to-image diffusion models, towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images. We validate our performance on multiple datasets: the CelebA, CUB-200-2011, Tai-Chi-HD, DeepFashion, and Human3.6m datasets.
arXiv Detail & Related papers (2023-11-29T19:43:38Z)
Knowledge Combination to Learn Rotated Detection Without Rotated Annotation [53.439096583978504]
Rotated bounding boxes drastically reduce output ambiguity of elongated objects. Despite the effectiveness, rotated detectors are not widely employed. We propose a framework that allows the model to predict precise rotated boxes.
arXiv Detail & Related papers (2023-04-05T03:07:36Z)
CITADEL: Conditional Token Interaction via Dynamic Lexical Routing for Efficient and Effective Multi-Vector Retrieval [72.90850213615427]
Multi-vector retrieval methods combine the merits of sparse (e.g. BM25) and dense (e.g. DPR) retrievers. These methods are orders of magnitude slower and need much more space to store their indices compared to their single-vector counterparts. We propose conditional token interaction via dynamic lexical routing, namely CITADEL, for efficient and effective multi-vector retrieval.
arXiv Detail & Related papers (2022-11-18T18:27:35Z)
Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence. We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders. Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
Sparse DETR: Efficient End-to-End Object Detection with Learnable Sparsity [10.098578160958946]
We show that Sparse DETR achieves better performance than Deformable DETR even with only 10% encoder tokens on the COCO dataset. Albeit only the encoder tokens are sparsified, the total computation cost decreases by 38% and the frames per second (FPS) increases by 42% compared to Deformable DETR.
arXiv Detail & Related papers (2021-11-29T05:22:46Z)
Efficient Person Search: An Anchor-Free Approach [86.45858994806471]
Person search aims to simultaneously localize and identify a query person from realistic, uncropped images. To achieve this goal, state-of-the-art models typically add a re-id branch upon two-stage detectors like Faster R-CNN. In this work, we present an anchor-free approach to efficiently tackling this challenging task, by introducing the following dedicated designs.
arXiv Detail & Related papers (2021-09-01T07:01:33Z)
End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem. Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components. The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.