Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection
- URL: http://arxiv.org/abs/2505.21868v1
- Date: Wed, 28 May 2025 01:33:23 GMT
- Title: Cross-DINO: Cross the Deep MLP and Transformer for Small Object Detection
- Authors: Guiping Cao, Wenjian Huang, Xiangyuan Lan, Jianguo Zhang, Dongmei Jiang, Yaowei Wang,
- Abstract summary: Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score.<n>We introduce a novel approach called Cross-DINO to address these challenges.<n>We show that Cross-DINO efficiently improves the performance of DETR-like models on SOD.
- Score: 39.56089737473775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Small Object Detection (SOD) poses significant challenges due to limited information and the model's low class prediction score. While Transformer-based detectors have shown promising performance, their potential for SOD remains largely unexplored. In typical DETR-like frameworks, the CNN backbone network, specialized in aggregating local information, struggles to capture the necessary contextual information for SOD. The multiple attention layers in the Transformer Encoder face difficulties in effectively attending to small objects and can also lead to blurring of features. Furthermore, the model's lower class prediction score of small objects compared to large objects further increases the difficulty of SOD. To address these challenges, we introduce a novel approach called Cross-DINO. This approach incorporates the deep MLP network to aggregate initial feature representations with both short and long range information for SOD. Then, a new Cross Coding Twice Module (CCTM) is applied to integrate these initial representations to the Transformer Encoder feature, enhancing the details of small objects. Additionally, we introduce a new kind of soft label named Category-Size (CS), integrating the Category and Size of objects. By treating CS as new ground truth, we propose a new loss function called Boost Loss to improve the class prediction score of the model. Extensive experimental results on COCO, WiderPerson, VisDrone, AI-TOD, and SODA-D datasets demonstrate that Cross-DINO efficiently improves the performance of DETR-like models on SOD. Specifically, our model achieves 36.4% APs on COCO for SOD with only 45M parameters, outperforming the DINO by +4.4% APS (36.4% vs. 32.0%) with fewer parameters and FLOPs, under 12 epochs training setting. The source codes will be available at https://github.com/Med-Process/Cross-DINO.
Related papers
- RRCANet: Recurrent Reusable-Convolution Attention Network for Infrared Small Target Detection [23.54800619558163]
Infrared small target detection is a challenging task due to its unique characteristics.<n>Recent CNN-based methods have achieved promising performance with heavy feature extraction and fusion modules.<n>We propose a recurrent reusable-convolution attention network (RRCA-Net) for infrared small target detection.
arXiv Detail & Related papers (2025-06-03T03:18:17Z) - Learning Dynamic Local Context Representations for Infrared Small Target Detection [5.897465234102489]
Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes.<n>We propose LCRNet, a novel method that learns dynamic local context representations for ISTD.<n>With only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-12-23T09:06:27Z) - Better Sampling, towards Better End-to-end Small Object Detection [7.7473020808686694]
Small object detection remains unsatisfactory due to limited characteristics and high density and mutual overlap.
We propose methods enhancing sampling within an end-to-end framework.
Our model demonstrates a significant enhancement, achieving a 2.9% increase in average precision (AP) over the state-of-the-art (SOTA) on the VisDrone dataset.
arXiv Detail & Related papers (2024-05-17T04:37:44Z) - FILP-3D: Enhancing 3D Few-shot Class-incremental Learning with Pre-trained Vision-Language Models [59.13757801286343]
Few-shot class-incremental learning aims to mitigate the catastrophic forgetting issue when a model is incrementally trained on limited data.<n>We introduce the FILP-3D framework with two novel components: the Redundant Feature Eliminator (RFE) for feature space misalignment and the Spatial Noise Compensator (SNC) for significant noise.
arXiv Detail & Related papers (2023-12-28T14:52:07Z) - Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for
Advanced Object Detection [55.2480439325792]
We present an in-depth evaluation of an object detection model that integrates the LSKNet backbone with the DiffusionDet head.
The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement.
This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis.
arXiv Detail & Related papers (2023-11-21T19:49:13Z) - Decoupled DETR For Few-shot Object Detection [4.520231308678286]
We improve the FSOD model to address the severe issue of sample imbalance and weak feature propagation.
We build a unified decoder module that could dynamically fuse the decoder layers as the output feature.
Our results indicate that our proposed module could achieve stable improvements of 5% to 10% in both fine-tuning and meta-learning paradigms.
arXiv Detail & Related papers (2023-11-20T07:10:39Z) - Small Object Detection via Coarse-to-fine Proposal Generation and
Imitation Learning [52.06176253457522]
We propose a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning.
CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A.
arXiv Detail & Related papers (2023-08-18T13:13:09Z) - Chosen methods of improving object recognition of small objects with
weak recognizable features [0.0]
Using proper GAN model would enable augmenting low precision data increasing their amount and diversity.
In this work the GAN-based method with augmentation is presented to improve small object detection on VOC Pascal dataset.
arXiv Detail & Related papers (2022-08-29T13:39:02Z) - DecAug: Augmenting HOI Detection via Decomposition [54.65572599920679]
Current algorithms suffer from insufficient training samples and category imbalance within datasets.
We propose an efficient and effective data augmentation method called DecAug for HOI detection.
Experiments show that our method brings up to 3.3 mAP and 1.6 mAP improvements on V-COCO and HICODET dataset.
arXiv Detail & Related papers (2020-10-02T13:59:05Z) - Reinforced Axial Refinement Network for Monocular 3D Object Detection [160.34246529816085]
Monocular 3D object detection aims to extract the 3D position and properties of objects from a 2D input image.
Conventional approaches sample 3D bounding boxes from the space and infer the relationship between the target object and each of them, however, the probability of effective samples is relatively small in the 3D space.
We propose to start with an initial prediction and refine it gradually towards the ground truth, with only one 3d parameter changed in each step.
This requires designing a policy which gets a reward after several steps, and thus we adopt reinforcement learning to optimize it.
arXiv Detail & Related papers (2020-08-31T17:10:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.