Related papers: SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection

URL: http://arxiv.org/abs/2601.02249v1
Date: Mon, 05 Jan 2026 16:31:41 GMT
Title: SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection
Authors: Xiantai Xiang, Guangyao Zhou, Zixiao Wen, Wenshuai Li, Ben Niu, Feng Wang, Lijia Huang, Qiantong Wang, Yuhan Liu, Zongxu Pan, Yuxin Hu,
Abstract summary: We propose SLGNet, a framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model.<n>SLGNet achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning.
Score: 28.779870703756668
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal object detection leveraging RGB and Infrared (IR) images is pivotal for robust perception in all-weather scenarios. While recent adapter-based approaches efficiently transfer RGB-pretrained foundation models to this task, they often prioritize model efficiency at the expense of cross-modal structural consistency. Consequently, critical structural cues are frequently lost when significant domain gaps arise, such as in high-contrast or nighttime environments. Moreover, conventional static multimodal fusion mechanisms typically lack environmental awareness, resulting in suboptimal adaptation and constrained detection performance under complex, dynamic scene variations. To address these limitations, we propose SLGNet, a parameter-efficient framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model. Specifically, we design a Structure-Aware Adapter to extract hierarchical structural representations from both modalities and dynamically inject them into the ViT to compensate for structural degradation inherent in ViT-based backbones. Furthermore, we propose a Language-Guided Modulation module that exploits VLM-driven structured captions to dynamically recalibrate visual features, thereby endowing the model with robust environmental awareness. Extensive experiments on the LLVIP, FLIR, KAIST, and DroneVehicle datasets demonstrate that SLGNet establishes new state-of-the-art performance. Notably, on the LLVIP benchmark, our method achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning. This confirms SLGNet as a robust and efficient solution for multimodal perception.

Related papers

StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z)
Zero-Reference Joint Low-Light Enhancement and Deblurring via Visual Autoregressive Modeling with VLM-Derived Modulation [18.67176370944511]
Real-world dark images commonly exhibit not only low visibility and contrast but also complex noise and blur, posing significant restoration challenges.<n>We propose a generative framework based on visual autoregressive ( VAR) modeling, guided by perceptual priors from the vision-language model (VLM)<n>Our framework is fully unsupervised and achieves state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-11-23T19:08:45Z)
RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models [48.91205564876609]
We propose a cost-effective and highly adaptable distillation framework to enhance lightweight object detectors.<n>Our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models.<n>Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
arXiv Detail & Related papers (2025-10-29T08:13:17Z)
Towards Efficient General Feature Prediction in Masked Skeleton Modeling [59.46799426434277]
We propose a novel General Feature Prediction framework (GFP) for efficient mask skeleton modeling.<n>Our key innovation is replacing conventional low-level reconstruction with high-level feature prediction that spans from local motion patterns to global semantic representations.
arXiv Detail & Related papers (2025-09-03T18:05:02Z)
Feature Complementation Architecture for Visual Place Recognition [19.779780157790423]
Visual place recognition (VPR) plays a crucial role in robotic localization and navigation.<n>Existing methods typically adopt convolutional neural networks (CNNs) or vision Transformers (ViTs) as feature extractors.<n>We propose a local-global feature complementation network (LGCN) for VPR which integrates a parallel CNN-ViT hybrid architecture with a dynamic feature fusion module (DFM)
arXiv Detail & Related papers (2025-06-14T08:32:55Z)
Structural Similarity-Inspired Unfolding for Lightweight Image Super-Resolution [88.20464308588889]
We propose a Structural Similarity-Inspired Unfolding (SSIU) method for efficient image SR.<n>This method is designed through unfolding an SR optimization function constrained by structural similarity.<n>Our model outperforms current state-of-the-art models, boasting lower parameter counts and reduced memory consumption.
arXiv Detail & Related papers (2025-06-13T14:29:40Z)
VRS-UIE: Value-Driven Reordering Scanning for Underwater Image Enhancement [104.78586859995333]
State Space Models (SSMs) have emerged as a promising backbone for vision tasks due to their linear complexity and global receptive field.<n>The predominance of large-portion, homogeneous but useless oceanic backgrounds can dilute the feature representation responses of sparse yet valuable targets.<n>We propose a novel Value-Driven Reordering Scanning framework for Underwater Image Enhancement (UIE)<n>Our framework sets a new state-of-the-art, delivering superior enhancement performance (surpassing WMamba by 0.89 dB on average) by effectively suppressing water bias and preserving structural and color fidelity.
arXiv Detail & Related papers (2025-05-02T12:21:44Z)
LSP-ST: Ladder Shape-Biased Side-Tuning for Robust Infrared Small Target Detection [4.5138645285711165]
We propose Ladder Shape-Biased Side-Tuning (LSP-ST), a novel approach that introduces a shape-aware inductive bias to facilitate effective adaptation beyond texture cues.<n>With only 4.72M learnable parameters, LSP-ST achieves state-of-the-art performance on multiple infrared small target detection benchmarks.
arXiv Detail & Related papers (2025-04-20T04:12:38Z)
Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach [0.0]
We fine-tuned the Florence2 model for object detection tasks in non-constructed, complex environments.<n> optimized Florence2 models exhibited significant improvements in object detection accuracy.
arXiv Detail & Related papers (2025-03-06T19:31:51Z)
VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition [54.27379947727035]
This paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification.<n>The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network.<n>The source code and pre-trained models will be released on urlhttps://github.com/Event-AHU/VELoRA.
arXiv Detail & Related papers (2024-12-28T07:38:23Z)
UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning [34.727262809777095]
We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.<n>Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.<n> Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z)
Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution [91.3781512926942]
Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. This work investigates the potential of network pruning for super-resolution iteration to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. We propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly network at each and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly.
arXiv Detail & Related papers (2023-03-16T21:06:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.