LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
- URL: http://arxiv.org/abs/2406.03459v1
- Date: Wed, 5 Jun 2024 17:07:24 GMT
- Title: LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
- Authors: Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang,
- Abstract summary: We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
- Score: 63.780355815743135
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).
Related papers
- What is YOLOv9: An In-Depth Exploration of the Internal Features of the Next-Generation Object Detector [0.0]
This study focuses on the YOLOv9 object detection model, focusing on its architectural innovations, training methodologies, and performance improvements.
Key advancements, such as the Generalized Efficient Layer Aggregation Network GELAN and Programmable Gradient Information PGI, significantly enhance feature extraction and gradient flow.
This paper provides the first in depth exploration of YOLOv9s internal features and their real world applicability, establishing it as a state of the art solution for real time object detection.
arXiv Detail & Related papers (2024-09-12T07:46:58Z) - DecoderTracker: Decoder-Only Method for Multiple-Object Tracking [10.819349280398363]
This paper attempts to construct a lightweight Decoder-only model: DecoderTracker for end-to-end multi-object tracking.
Specifically, we have developed an image feature extraction network which can efficiently extract features from images to replace the encoder structure.
On the DanceTrack dataset, without any bells and whistles, DecoderTracker's tracking performance slightly surpasses that of MOTR, with approximately twice the inference speed.
arXiv Detail & Related papers (2023-10-26T05:49:44Z) - ViT-ReciproCAM: Gradient and Attention-Free Visual Explanations for
Vision Transformer [0.0]
Vision Transformers (ViT) have demonstrated superior performance in various computer vision tasks such as image classification and object detection.
Current state-of-the-art solutions for ViT rely on class Attention-Rollout and Relevance techniques.
We propose a new gradient-free visual explanation method for ViT, called ViT-ReciproCAM, which does not require attention matrix and gradient information.
arXiv Detail & Related papers (2023-10-04T05:09:50Z) - Spatiotemporal Attention-based Semantic Compression for Real-time Video
Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame.
We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information.
Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z) - Lite DETR : An Interleaved Multi-Scale Encoder for Efficient DETR [27.120786736090842]
We present Lite DETR, a simple yet efficient end-to-end object detection framework.
We design an efficient encoder block to update high-level features and low-level features.
To better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights.
arXiv Detail & Related papers (2023-03-13T17:57:59Z) - RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in
Autonomous Driving [80.14669385741202]
Vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks.
ViTs are notoriously hard to train and require a lot of training data to learn powerful representations.
We show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and Semantic KITTI.
arXiv Detail & Related papers (2023-01-24T18:50:48Z) - Pair DETR: Contrastive Learning Speeds Up DETR Training [0.6491645162078056]
We present a simple approach to address the main problem of DETR, the slow convergence.
We detect an object bounding box as a pair of keypoints, the top-left corner and the center, using two decoders.
Experiments show that Pair DETR can converge at least 10x faster than original DETR and 1.5x faster than Conditional DETR during training.
arXiv Detail & Related papers (2022-10-29T03:02:49Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.