YOLOSA: Object detection based on 2D local feature superimposed
self-attention
- URL: http://arxiv.org/abs/2206.11825v1
- Date: Thu, 23 Jun 2022 16:49:21 GMT
- Title: YOLOSA: Object detection based on 2D local feature superimposed
self-attention
- Authors: Weisheng Li and Lin Huang
- Abstract summary: We propose a novel self-attention module, called 2D local feature superimposed self-attention, for the feature concatenation stage of the neck network.
Average precisions of 49.0% (66.2 FPS), 46.1% (80.6 FPS), and 39.1% (100 FPS) were obtained for large, medium, and small-scale models built using our proposed improvements.
- Score: 13.307581544820248
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We analyzed the network structure of real-time object detection models and
found that the features in the feature concatenation stage are very rich.
Applying an attention module here can effectively improve the detection
accuracy of the model. However, the commonly used attention module or
self-attention module shows poor performance in detection accuracy and
inference efficiency. Therefore, we propose a novel self-attention module,
called 2D local feature superimposed self-attention, for the feature
concatenation stage of the neck network. This self-attention module reflects
global features through local features and local receptive fields. We also
propose and optimize an efficient decoupled head and AB-OTA, and achieve SOTA
results. Average precisions of 49.0\% (66.2 FPS), 46.1\% (80.6 FPS), and 39.1\%
(100 FPS) were obtained for large, medium, and small-scale models built using
our proposed improvements. Our models exceeded YOLOv5 by 0.8\% -- 3.1\% in
average precision.
Related papers
- Efficient Feature Aggregation and Scale-Aware Regression for Monocular 3D Object Detection [40.14197775884804]
MonoASRH is a novel monocular 3D detection framework composed of Efficient Hybrid Feature Aggregation Module (EH-FAM) and Adaptive Scale-Aware 3D Regression Head (ASRH)
EH-FAM employs multi-head attention with a global receptive field to extract semantic features for small-scale objects.
ASRH encodes 2D bounding box dimensions and then fuses scale features with the semantic features aggregated by EH-FAM.
arXiv Detail & Related papers (2024-11-05T02:33:25Z) - Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models [64.67721492968941]
We propose a Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) framework.
Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness.
Our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-29T07:15:09Z) - Stanceformer: Target-Aware Transformer for Stance Detection [59.69858080492586]
Stance Detection involves discerning the stance expressed in a text towards a specific subject or target.
Prior works have relied on existing transformer models that lack the capability to prioritize targets effectively.
We introduce Stanceformer, a target-aware transformer model that incorporates enhanced attention towards the targets during both training and inference.
arXiv Detail & Related papers (2024-10-09T17:24:28Z) - PVAFN: Point-Voxel Attention Fusion Network with Multi-Pooling Enhancing for 3D Object Detection [59.355022416218624]
integration of point and voxel representations is becoming more common in LiDAR-based 3D object detection.
We propose a novel two-stage 3D object detector, called Point-Voxel Attention Fusion Network (PVAFN)
PVAFN uses a multi-pooling strategy to integrate both multi-scale and region-specific information effectively.
arXiv Detail & Related papers (2024-08-26T19:43:01Z) - YOLO-TLA: An Efficient and Lightweight Small Object Detection Model based on YOLOv5 [19.388112026410045]
YOLO-TLA is an advanced object detection model building on YOLOv5.
We first introduce an additional detection layer for small objects in the neck network pyramid architecture.
This module uses sliding window feature extraction, which effectively minimizes both computational demand and the number of parameters.
arXiv Detail & Related papers (2024-02-22T05:55:17Z) - AGO-Net: Association-Guided 3D Point Cloud Object Detection Network [86.10213302724085]
We propose a novel 3D detection framework that associates intact features for objects via domain adaptation.
We achieve new state-of-the-art performance on the KITTI 3D detection benchmark in both accuracy and speed.
arXiv Detail & Related papers (2022-08-24T16:54:38Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - AGSFCOS: Based on attention mechanism and Scale-Equalizing pyramid
network of object detection [10.824032219531095]
Our model has a certain improvement in accuracy compared with the current popular detection models on the COCO dataset.
Our optimal model can get 39.5% COCO AP under the background of ResNet50.
arXiv Detail & Related papers (2021-05-20T08:41:02Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z) - InfoFocus: 3D Object Detection for Autonomous Driving with Dynamic
Information Modeling [65.47126868838836]
We propose a novel 3D object detection framework with dynamic information modeling.
Coarse predictions are generated in the first stage via a voxel-based region proposal network.
Experiments are conducted on the large-scale nuScenes 3D detection benchmark.
arXiv Detail & Related papers (2020-07-16T18:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.