DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection
- URL: http://arxiv.org/abs/2512.07078v1
- Date: Mon, 08 Dec 2025 01:25:10 GMT
- Title: DFIR-DETR: Frequency Domain Enhancement and Dynamic Feature Aggregation for Cross-Scene Small Object Detection
- Authors: Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li,
- Abstract summary: Small object detection in UAV remote sensing images is difficult.<n>Current transformer-based detectors struggle with three critical issues.<n>We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing.
- Score: 16.16000521213211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Detecting small objects in UAV remote sensing images and identifying surface defects in industrial inspection remain difficult tasks. These applications face common obstacles: features are sparse and weak, backgrounds are cluttered, and object scales vary dramatically. Current transformer-based detectors, while powerful, struggle with three critical issues. First, features degrade severely as networks downsample progressively. Second, spatial convolutions cannot capture long-range dependencies effectively. Third, standard upsampling methods inflate feature maps unnecessarily. We introduce DFIR-DETR to tackle these problems through dynamic feature aggregation combined with frequency-domain processing. Our architecture builds on three novel components. The DCFA module uses dynamic K-sparse attention, cutting complexity from O(N2) down to O(NK), and employs spatial gated linear units for better nonlinear modeling. The DFPN module applies amplitude-normalized upsampling to prevent feature inflation and uses dual-path shuffle convolution to retain spatial details across scales. The FIRC3 module operates in the frequency domain, achieving global receptive fields without sacrificing efficiency. We tested our method extensively on NEU-DET and VisDrone datasets. Results show mAP50 scores of 92.9% and 51.6% respectively-both state-of-the-art. The model stays lightweight with just 11.7M parameters and 41.2 GFLOPs. Strong performance across two very different domains confirms that DFIR-DETR generalizes well and works effectively in resource-limited settings for cross-scene small object detection.
Related papers
- TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder [66.22997415145467]
This paper presents a joint completion and detection framework that improves the detection feature in sparse areas.<n> Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.<n>The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods.
arXiv Detail & Related papers (2025-12-12T00:08:03Z) - FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection [18.023418423273082]
We propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection.<n>First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception.<n>Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction.<n>Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to balance detail preservation and global enhancement.
arXiv Detail & Related papers (2025-09-27T02:28:22Z) - Wavelet-Guided Dual-Frequency Encoding for Remote Sensing Change Detection [67.84730634802204]
Change detection in remote sensing imagery plays a vital role in various engineering applications, such as natural disaster monitoring, urban expansion tracking, and infrastructure management.<n>Most existing methods still rely on spatial-domain modeling, where the limited diversity of feature representations hinders the detection of subtle change regions.<n>We observe that frequency-domain feature modeling particularly in the wavelet domain amplify fine-grained differences in frequency components, enhancing the perception of edge changes that are challenging to capture in the spatial domain.
arXiv Detail & Related papers (2025-08-07T11:14:16Z) - ARFC-WAHNet: Adaptive Receptive Field Convolution and Wavelet-Attentive Hierarchical Network for Infrared Small Target Detection [2.643590634429843]
ARFC-WAHNet is an adaptive receptive field convolution and wavelet-attentive hierarchical network for infrared small target detection.<n>ARFC-WAHNet outperforms recent state-of-the-art methods in both detection accuracy and robustness.
arXiv Detail & Related papers (2025-05-15T09:44:23Z) - An Efficient Aerial Image Detection with Variable Receptive Fields [0.0]
We propose a transformer-based detector incorporating three key components.<n>VRF-DETR achieves 51.4% mAPtextsubscript50 and 31.8% mAPtextsubscript50:95 with only 13.5M parameters.
arXiv Detail & Related papers (2025-04-21T15:16:13Z) - SFMNet: Sparse Focal Modulation for 3D Object Detection [11.19540223578237]
SFMNet is a novel 3D sparse detector that combines the efficiency of sparse convolutions with the ability to model long-range dependencies.<n>We show that our detector achieves state-of-the-art performance on autonomous driving datasets.
arXiv Detail & Related papers (2025-03-15T11:40:58Z) - Learning Dynamic Local Context Representations for Infrared Small Target Detection [5.897465234102489]
Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes.<n>We propose LCRNet, a novel method that learns dynamic local context representations for ISTD.<n>With only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-12-23T09:06:27Z) - Frequency Perception Network for Camouflaged Object Detection [51.26386921922031]
We propose a novel learnable and separable frequency perception mechanism driven by the semantic hierarchy in the frequency domain.<n>Our entire network adopts a two-stage model, including a frequency-guided coarse localization stage and a detail-preserving fine localization stage.<n>Compared with the currently existing models, our proposed method achieves competitive performance in three popular benchmark datasets.
arXiv Detail & Related papers (2023-08-17T11:30:46Z) - DSVT: Dynamic Sparse Voxel Transformer with Rotated Sets [95.84755169585492]
We present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception.
Our model achieves state-of-the-art performance with a broad range of 3D perception tasks.
arXiv Detail & Related papers (2023-01-15T09:31:58Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - CAINNFlow: Convolutional block Attention modules and Invertible Neural
Networks Flow for anomaly detection and localization tasks [28.835943674247346]
In this study, we design a complex function model with alternating CBAM embedded in a stacked $3times3$ full convolution, which is able to retain and effectively extract spatial structure information.
Experiments show that CAINNFlow achieves advanced levels of accuracy and inference efficiency based on CNN and Transformer backbone networks as feature extractors.
arXiv Detail & Related papers (2022-06-04T13:45:08Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - SCRDet++: Detecting Small, Cluttered and Rotated Objects via
Instance-Level Feature Denoising and Rotation Loss Smoothing [131.04304632759033]
Small and cluttered objects are common in real-world which are challenging for detection.
In this paper, we first innovatively introduce the idea of denoising to object detection.
Instance-level denoising on the feature map is performed to enhance the detection to small and cluttered objects.
arXiv Detail & Related papers (2020-04-28T06:03:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.