HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection
- URL: http://arxiv.org/abs/2504.13469v1
- Date: Fri, 18 Apr 2025 05:24:08 GMT
- Title: HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection
- Authors: YangChen Zeng,
- Abstract summary: This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique.<n>HMPE dynamically integrates positional encoding with semantic detection information through heatmap-guided adaptive learning.<n>By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current Transformer-based methods for small object detection continue emerging, yet they have still exhibited significant shortcomings. This paper introduces HeatMap Position Embedding (HMPE), a novel Transformer Optimization technique that enhances object detection performance by dynamically integrating positional encoding with semantic detection information through heatmap-guided adaptive learning.We also innovatively visualize the HMPE method, offering clear visualization of embedded information for parameter fine-tuning.We then create Multi-Scale ObjectBox-Heatmap Fusion Encoder (MOHFE) and HeatMap Induced High-Quality Queries for Decoder (HIDQ) modules. These are designed for the encoder and decoder, respectively, to generate high-quality queries and reduce background noise queries.Using both heatmap embedding and Linear-Snake Conv(LSConv) feature engineering, we enhance the embedding of massively diverse small object categories and reduced the decoder multihead layers, thereby accelerating both inference and training.In the generalization experiments, our approach outperforme the baseline mAP by 1.9% on the small object dataset (NWPU VHR-10) and by 1.2% on the general dataset (PASCAL VOC). By employing HMPE-enhanced embedding, we are able to reduce the number of decoder layers from eight to a minimum of three, significantly decreasing both inference and training costs.
Related papers
- Purifying, Labeling, and Utilizing: A High-Quality Pipeline for Small Object Detection [83.90563802153707]
PLUSNet is a high-quality Small object detection framework.
It comprises three components: the Hierarchical Feature (HFP) framework for purifying upstream features, the Multiple Criteria Label Assignment (MCLA) for improving the quality of midstream training samples, and the Frequency Decoupled Head (FDHead) for more effectively exploiting information to accomplish downstream tasks.
arXiv Detail & Related papers (2025-04-29T10:11:03Z) - DOEI: Dual Optimization of Embedding Information for Attention-Enhanced Class Activation Maps [30.53564087005569]
Weakly supervised semantic segmentation (WSSS) typically utilizes limited semantic annotations to obtain initial Class Activation Maps (CAMs)<n>Due to the inadequate coupling between class activation responses and semantic information in high-dimensional space, the CAM is prone to object co-occurrence or under-activation.<n>We propose DOEI, Dual Optimization of Embedding Information, a novel approach that reconstructs embedding representations through semantic-aware attention weight matrices.
arXiv Detail & Related papers (2025-02-21T19:06:01Z) - LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z) - Introducing a microstructure-embedded autoencoder approach for reconstructing high-resolution solution field data from a reduced parametric space [0.0]
We develop a novel multi-fidelity deep learning approach that transforms low-fidelity solution maps into high-fidelity ones by incorporating parametric space information into a standard autoencoder architecture.
This method's integration of parametric space information significantly reduces the need for training data to effectively predict high-fidelity solutions from low-fidelity ones.
arXiv Detail & Related papers (2024-05-03T10:00:36Z) - Minimalist and High-Performance Semantic Segmentation with Plain Vision
Transformers [10.72362704573323]
We introduce the PlainSeg, a model comprising only three 3$times$3 convolutions in addition to the transformer layers.
We also present the PlainSeg-Hier, which allows for the utilization of hierarchical features.
arXiv Detail & Related papers (2023-10-19T14:01:40Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - Ultra-High-Definition Low-Light Image Enhancement: A Benchmark and
Transformer-Based Method [51.30748775681917]
We consider the task of low-light image enhancement (LLIE) and introduce a large-scale database consisting of images at 4K and 8K resolution.
We conduct systematic benchmarking studies and provide a comparison of current LLIE algorithms.
As a second contribution, we introduce LLFormer, a transformer-based low-light enhancement method.
arXiv Detail & Related papers (2022-12-22T09:05:07Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - TRACER: Extreme Attention Guided Salient Object Tracing Network [3.2434811678562676]
We propose TRACER, which detects salient objects with explicit edges by incorporating attention guided tracing modules.
A comparison with 13 existing methods reveals that TRACER achieves state-of-the-art performance on five benchmark datasets.
arXiv Detail & Related papers (2021-12-14T13:20:07Z) - A Holistically-Guided Decoder for Deep Representation Learning with
Applications to Semantic Segmentation and Object Detection [74.88284082187462]
One common strategy is to adopt dilated convolutions in the backbone networks to extract high-resolution feature maps.
We propose one novel holistically-guided decoder which is introduced to obtain the high-resolution semantic-rich feature maps.
arXiv Detail & Related papers (2020-12-18T10:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.