Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection
- URL: http://arxiv.org/abs/2511.19306v1
- Date: Mon, 24 Nov 2025 16:58:23 GMT
- Title: Dual-Granularity Semantic Prompting for Language Guidance Infrared Small Target Detection
- Authors: Zixuan Wang, Haoran Sun, Jiaming Lu, Wenxuan Wang, Zhongling Huang, Dingwen Zhang, Xuelin Qian, Junwei Han,
- Abstract summary: Infrared small target detection remains challenging due to limited feature representation and severe background interference.<n>We propose DGSPNet, an end-to-end language prompt-driven framework.<n>Our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
- Score: 102.1314414263959
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Infrared small target detection remains challenging due to limited feature representation and severe background interference, resulting in sub-optimal performance. While recent CLIP-inspired methods attempt to leverage textual guidance for detection, they are hindered by inaccurate text descriptions and reliance on manual annotations. To overcome these limitations, we propose DGSPNet, an end-to-end language prompt-driven framework. Our approach integrates dual-granularity semantic prompts: coarse-grained textual priors (e.g., 'infrared image', 'small target') and fine-grained personalized semantic descriptions derived through visual-to-textual mapping within the image space. This design not only facilitates learning fine-grained semantic information but also can inherently leverage language prompts during inference without relying on any annotation requirements. By fully leveraging the precision and conciseness of text descriptions, we further introduce a text-guide channel attention (TGCA) mechanism and text-guide spatial attention (TGSA) mechanism that enhances the model's sensitivity to potential targets across both low- and high-level feature spaces. Extensive experiments demonstrate that our method significantly improves detection accuracy and achieves state-of-the-art performance on three benchmark datasets.
Related papers
- Beyond Magic Words: Sharpness-Aware Prompt Evolving for Robust Large Language Models with TARE [57.67972800581953]
We introduce TARE, a derivative-free framework that alternates between an inner, sampling-based adversarial search that stresses a prompt with hard paraphrases.<n>We also propose ATARE, which learns anisotropic weights to shape the semantic neighborhood and adapts its radius over time to balance exploration and fidelity.
arXiv Detail & Related papers (2025-09-28T23:57:05Z) - AttriPrompt: Dynamic Prompt Composition Learning for CLIP [41.37140060183439]
AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
arXiv Detail & Related papers (2025-09-07T07:07:59Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation [16.64056234334767]
Open-vocabulary semantic segmentation aims to segment images into distinct semantic regions at the pixel level.<n>Current methods utilize text embeddings from pre-trained vision-language models like CLIP.<n>We propose a dual prompting framework, DPSeg, for this task.
arXiv Detail & Related papers (2025-05-16T20:25:42Z) - Text-IRSTD: Leveraging Semantic Text to Promote Infrared Small Target Detection in Complex Scenes [3.399048100638418]
We introduce a novel approach leveraging semantic text to guide infrared small target detection, called Text-IRSTD.<n>We propose a progressive cross-modal semantic interaction decoder (PCSID) to facilitate information fusion between texts and images.<n>In addition, we construct a new benchmark consisting of 2,755 infrared images of different scenarios with fuzzy semantic textual annotations, called FZDT.
arXiv Detail & Related papers (2025-03-10T12:33:07Z) - Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views [66.1245505423179]
We show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB.<n>Our method enhances NeRF's performance by incorporating guidance derived from the rendered semantics.
arXiv Detail & Related papers (2025-03-04T03:13:44Z) - T2ICount: Enhancing Cross-modal Understanding for Zero-Shot Counting [30.004769932953952]
Zero-shot object counting aims to count instances of arbitrary object categories specified by text descriptions.<n>We present T2ICount, a diffusion-based framework that leverages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models.
arXiv Detail & Related papers (2025-02-28T01:09:18Z) - TextSleuth: Towards Explainable Tampered Text Detection [49.88698441048043]
We propose to explain the basis of tampered text detection with natural language via large multimodal models.<n>To fill the data gap for this task, we propose a large-scale, comprehensive dataset, ETTD.<n>Elaborate queries are introduced to generate high-quality anomaly descriptions with GPT4o.<n>To automatically filter out low-quality annotations, we also propose to prompt GPT4o to recognize tampered texts.
arXiv Detail & Related papers (2024-12-19T13:10:03Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Towards Weakly-Supervised Text Spotting using a Multi-Task Transformer [21.479222207347238]
We introduce TextTranSpotter (TTS), a transformer-based approach for text spotting.
TTS is trained with both fully- and weakly-supervised settings.
trained in a fully-supervised manner, TextTranSpotter shows state-of-the-art results on multiple benchmarks.
arXiv Detail & Related papers (2022-02-11T08:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.