Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection
- URL: http://arxiv.org/abs/2503.02220v1
- Date: Tue, 04 Mar 2025 02:53:25 GMT
- Title: Low-Level Matters: An Efficient Hybrid Architecture for Robust Multi-frame Infrared Small Target Detection
- Authors: Zhihua Shen, Siyang Chen, Han Wang, Tongsu Zhang, Xiaohu Zhang, Xiangpeng Xu, Xia Yang,
- Abstract summary: Multi-frame infrared small target detection plays a crucial role in low-altitude and maritime surveillance.<n>The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD.<n>We propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning hybrid frameworks.
- Score: 5.048364655933007
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-frame infrared small target detection (IRSTD) plays a crucial role in low-altitude and maritime surveillance. The hybrid architecture combining CNNs and Transformers shows great promise for enhancing multi-frame IRSTD performance. In this paper, we propose LVNet, a simple yet powerful hybrid architecture that redefines low-level feature learning in hybrid frameworks for multi-frame IRSTD. Our key insight is that the standard linear patch embeddings in Vision Transformers are insufficient for capturing the scale-sensitive local features critical to infrared small targets. To address this limitation, we introduce a multi-scale CNN frontend that explicitly models local features by leveraging the local spatial bias of convolution. Additionally, we design a U-shaped video Transformer for multi-frame spatiotemporal context modeling, effectively capturing the motion characteristics of targets. Experiments on the publicly available datasets IRDST and NUDT-MIRSDT demonstrate that LVNet outperforms existing state-of-the-art methods. Notably, compared to the current best-performing method, LMAFormer, LVNet achieves an improvement of 5.63\% / 18.36\% in nIoU, while using only 1/221 of the parameters and 1/92 / 1/21 of the computational cost. Ablation studies further validate the importance of low-level representation learning in hybrid architectures. Our code and trained models are available at https://github.com/ZhihuaShen/LVNet.
Related papers
- An Efficient and Mixed Heterogeneous Model for Image Restoration [71.85124734060665]
Current mainstream approaches are based on three architectural paradigms: CNNs, Transformers, and Mambas.
We propose RestorMixer, an efficient and general-purpose IR model based on mixed-architecture fusion.
arXiv Detail & Related papers (2025-04-15T08:19:12Z) - VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition [54.27379947727035]
This paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification.<n>The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network.<n>The source code and pre-trained models will be released on urlhttps://github.com/Event-AHU/VELoRA.
arXiv Detail & Related papers (2024-12-28T07:38:23Z) - Learning Dynamic Local Context Representations for Infrared Small Target Detection [5.897465234102489]
Infrared small target detection (ISTD) is challenging due to complex backgrounds, low signal-to-clutter ratios, and varying target sizes and shapes.<n>We propose LCRNet, a novel method that learns dynamic local context representations for ISTD.<n>With only 1.65M parameters, LCRNet achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-12-23T09:06:27Z) - CTA-Net: A CNN-Transformer Aggregation Network for Improving Multi-Scale Feature Extraction [14.377544481394013]
CTA-Net combines CNNs and ViTs, with transformers capturing long-range dependencies and CNNs extracting localized features.
This integration enables efficient processing of detailed local and broader contextual information.
Experiments on small-scale datasets with fewer than 100,000 samples show that CTA-Net achieves superior performance.
arXiv Detail & Related papers (2024-10-15T09:27:26Z) - SIGMA:Sinkhorn-Guided Masked Video Modeling [69.31715194419091]
Sinkhorn-guided Masked Video Modelling ( SIGMA) is a novel video pretraining method.
We distribute features of space-time tubes evenly across a limited number of learnable clusters.
Experimental results on ten datasets validate the effectiveness of SIGMA in learning more performant, temporally-aware, and robust video representations.
arXiv Detail & Related papers (2024-07-22T08:04:09Z) - Multiscale Low-Frequency Memory Network for Improved Feature Extraction
in Convolutional Neural Networks [13.815116154370834]
We introduce a novel framework, the Multiscale Low-Frequency Memory (MLFM) Network.
The MLFM efficiently preserves low-frequency information, enhancing performance in targeted computer vision tasks.
Our work builds upon the existing CNN foundations and paves the way for future advancements in computer vision.
arXiv Detail & Related papers (2024-03-13T00:48:41Z) - Bridging the Gap Between Vision Transformers and Convolutional Neural
Networks on Small Datasets [91.25055890980084]
There still remains an extreme performance gap between Vision Transformers (ViTs) and Convolutional Neural Networks (CNNs) when training from scratch on small datasets.
We propose Dynamic Hybrid Vision Transformer (DHVT) as the solution to enhance the two inductive biases.
Our DHVT achieves a series of state-of-the-art performance with a lightweight model, 85.68% on CIFAR-100 with 22.8M parameters, 82.3% on ImageNet-1K with 24.0M parameters.
arXiv Detail & Related papers (2022-10-12T06:54:39Z) - Magic ELF: Image Deraining Meets Association Learning and Transformer [63.761812092934576]
This paper aims to unify CNN and Transformer to take advantage of their learning merits for image deraining.
A novel multi-input attention module (MAM) is proposed to associate rain removal and background recovery.
Our proposed method (dubbed as ELF) outperforms the state-of-the-art approach (MPRNet) by 0.25 dB on average.
arXiv Detail & Related papers (2022-07-21T12:50:54Z) - SideRT: A Real-time Pure Transformer Architecture for Single Image Depth
Estimation [11.513054537848227]
We propose a pure transformer architecture called SideRT that can attain excellent predictions in real-time.
This is the first work to show that transformer-based networks can attain state-of-the-art performance in real-time in the single image depth estimation field.
arXiv Detail & Related papers (2022-04-29T05:46:20Z) - Global Filter Networks for Image Classification [90.81352483076323]
We present a conceptually simple yet computationally efficient architecture that learns long-term spatial dependencies in the frequency domain with log-linear complexity.
Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness.
arXiv Detail & Related papers (2021-07-01T17:58:16Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z) - Volterra Neural Networks (VNNs) [24.12314339259243]
We propose a Volterra filter-inspired Network architecture to reduce the complexity of Convolutional Neural Networks.
We show an efficient parallel implementation of this Volterra Neural Network (VNN) along with its remarkable performance.
The proposed approach is evaluated on UCF-101 and HMDB-51 datasets for action recognition, and is shown to outperform state of the art CNN approaches.
arXiv Detail & Related papers (2019-10-21T19:22:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.