SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection
- URL: http://arxiv.org/abs/2602.01843v1
- Date: Mon, 02 Feb 2026 09:15:29 GMT
- Title: SPIRIT: Adapting Vision Foundation Models for Unified Single- and Multi-Frame Infrared Small Target Detection
- Authors: Qian Xu, Xi Li, Fei Gao, Jie Guo, Haojuan Yuan, Shuaipeng Fan, Mingjin Zhang,
- Abstract summary: Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking.<n>We propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins.
- Score: 18.86422994684341
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Infrared small target detection (IRSTD) is crucial for surveillance and early-warning, with deployments spanning both single-frame analysis and video-mode tracking. A practical solution should leverage vision foundation models (VFMs) to mitigate infrared data scarcity, while adopting a memory-attention-based temporal propagation framework that unifies single- and multi-frame inference. However, infrared small targets exhibit weak radiometric signals and limited semantic cues, which differ markedly from visible-spectrum imagery. This modality gap makes direct use of semantics-oriented VFMs and appearance-driven cross-frame association unreliable for IRSTD: hierarchical feature aggregation can submerge localized target peaks, and appearance-only memory attention becomes ambiguous, leading to spurious clutter associations. To address these challenges, we propose SPIRIT, a unified and VFM-compatible framework that adapts VFMs to IRSTD via lightweight physics-informed plug-ins. Spatially, PIFR refines features by approximating rank-sparsity decomposition to suppress structured background components and enhance sparse target-like signals. Temporally, PGMA injects history-derived soft spatial priors into memory cross-attention to constrain cross-frame association, enabling robust video detection while naturally reverting to single-frame inference when temporal context is absent. Experiments on multiple IRSTD benchmarks show consistent gains over VFM-based baselines and SOTA performance.
Related papers
- Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection [22.128797773091403]
Infrared small target detection (IRSTD) faces significant challenges due to the low signal-to-noise ratio (SNR), small target size, and complex cluttered backgrounds.<n>Recent DETR-based detectors exhibit notable performance degradation on IRSTD.
arXiv Detail & Related papers (2026-01-06T09:14:01Z) - Rethinking Infrared Small Target Detection: A Foundation-Driven Efficient Paradigm [17.63632082331749]
Large-scale visual foundation models (VFMs) exhibit strong generalization across diverse visual domains, but their potential for single-frame infrared small target (SIRST) detection remains largely unexplored.<n>We propose a Foundation-Driven Efficient Paradigm (FDEP) which can seamlessly adapt to existing encoder-decoder-based methods and significantly improve accuracy without additional inference overhead.
arXiv Detail & Related papers (2025-12-05T08:12:35Z) - IrisNet: Infrared Image Status Awareness Meta Decoder for Infrared Small Targets Detection [92.56025546608699]
IrisNet is a novel meta-learned framework that adapts detection strategies to the input infrared image status.<n>Our approach establishes a dynamic mapping between infrared image features and entire decoder parameters.<n> Experiments on NUDT-SIRST, NUAA-SIRST, and IRSTD-1K datasets demonstrate the superiority of our IrisNet.
arXiv Detail & Related papers (2025-11-25T13:53:54Z) - Rethinking Diffusion Model-Based Video Super-Resolution: Leveraging Dense Guidance from Aligned Features [51.5076190312734]
Video Super-Resolution approaches suffer from error accumulation, spatial artifacts, and a trade-off between perceptual quality and fidelity.<n>We propose a novelly Guided diffusion model with Aligned Features for Video Super-Resolution (DGAF-VSR)<n>Experiments on synthetic and real-world datasets demonstrate that DGAF-VSR surpasses state-of-the-art methods in key aspects of VSR.
arXiv Detail & Related papers (2025-11-21T03:40:45Z) - Source-Free Object Detection with Detection Transformer [59.33653163035064]
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data.<n>Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR)<n>In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs.
arXiv Detail & Related papers (2025-10-13T07:35:04Z) - VFM-Guided Semi-Supervised Detection Transformer under Source-Free Constraints for Remote Sensing Object Detection [9.029534000674388]
VG-DETR integrates a Vision Foundation Model (VFM) into the training pipeline in a "free lunch" manner.<n>We introduce a VFM-guided pseudo-label mining strategy that leverages the VFM's semantic priors to assess the reliability of the generated pseudo-labels.<n>In addition, a dual-level VFM-guided alignment method is proposed, which aligns detector features with VFM embeddings at both the instance and image levels.
arXiv Detail & Related papers (2025-08-15T02:35:56Z) - It's Not the Target, It's the Background: Rethinking Infrared Small Target Detection via Deep Patch-Free Low-Rank Representations [5.326302374594885]
In this paper, we propose a novel end-to-end IRSTD framework, termed LRRNet.<n>Inspired by the physical compressibility of cluttered scenes, our approach adopts a compression-reconstruction-subtraction paradigm.<n>Experiments on multiple public datasets demonstrate that LRRNet outperforms 38 state-of-the-art methods in terms of detection accuracy, robustness, and computational efficiency.
arXiv Detail & Related papers (2025-06-12T07:24:45Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [49.81255045696323]
We present the Auxiliary Metadata Driven Infrared Small Target Detector (AuxDet)<n>AuxDet integrates metadata semantics with visual features, guiding adaptive representation learning for each sample.<n>Experiments on the challenging WideIRSTD-Full benchmark demonstrate that AuxDet consistently outperforms state-of-the-art methods.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - Multi-Domain Biometric Recognition using Body Embeddings [51.36007967653781]
We show that body embeddings perform better than face embeddings in medium-wave infrared (MWIR) and long-wave infrared (LWIR) domains.<n>We leverage a vision transformer architecture to establish benchmark results on the IJB-MDF dataset.<n>We also show that finetuning a body model, pretrained exclusively on VIS data, with a simple combination of cross-entropy and triplet losses achieves state-of-the-art mAP scores.
arXiv Detail & Related papers (2025-03-13T22:38:18Z) - Frequency Domain Modality-invariant Feature Learning for
Visible-infrared Person Re-Identification [79.9402521412239]
We propose a novel Frequency Domain modality-invariant feature learning framework (FDMNet) to reduce modality discrepancy from the frequency domain perspective.
Our framework introduces two novel modules, namely the Instance-Adaptive Amplitude Filter (IAF) and the Phrase-Preserving Normalization (PPNorm)
arXiv Detail & Related papers (2024-01-03T17:11:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.