IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks
- URL: http://arxiv.org/abs/2412.16654v3
- Date: Sat, 02 Aug 2025 09:33:14 GMT
- Title: IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks
- Authors: Yaming Zhang, Chenqiang Gao, Fangcen Liu, Junjie Guo, Lan Wang, Xinggan Peng, Deyu Meng,
- Abstract summary: Under the full fine-tuning paradigm, the feature space becomes highly constrained and low-ranked, which has been proven to seriously impair generalization.<n>We propose IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection.<n>Compared with the full fine-tuning baselines and existing IR-VIS methods, IV-tuning facilitates the learning of complementary information between infrared and visible modalities with less than 3% of the backbone parameters.
- Score: 47.08388430506686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing infrared and visible (IR-VIS) methods inherit the general representations of Pre-trained Visual Models (PVMs) to facilitate complementary learning. However, our analysis indicates that under the full fine-tuning paradigm, the feature space becomes highly constrained and low-ranked, which has been proven to seriously impair generalization. One solution is freezing parameters to preserve pre-trained knowledge and thus maintain diversity of the feature space. To this end, we propose IV-tuning, to parameter-efficiently harness PVMs for various IR-VIS downstream tasks, including salient object detection, semantic segmentation, and object detection. Compared with the full fine-tuning baselines and existing IR-VIS methods, IV-tuning facilitates the learning of complementary information between infrared and visible modalities with less than 3% of the backbone parameters, and effectively alleviates the overfitting problem. The code is available in https://github.com/Yummy198913/IV-tuning.
Related papers
- One RL to See Them All: Visual Triple Unified Reinforcement Learning [92.90120580989839]
We propose V-Triune, a Visual Triple Unified Reinforcement Learning system that enables visual reasoning and perception tasks within a single training pipeline.<n>V-Triune comprises triple complementary components: Sample-Level Datashelf (to unify diverse task inputs), Verifier-Level Reward (to deliver custom rewards via specialized verifiers).<n>We introduce a novel Dynamic IoU reward, which provides adaptive, progressive, and definite feedback for perception tasks handled by V-Triune.
arXiv Detail & Related papers (2025-05-23T17:41:14Z) - AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z) - Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation [5.326302374594885]
Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios.
We propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors.
arXiv Detail & Related papers (2025-04-20T04:12:38Z) - DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding [43.85632218045282]
We introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM)
PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength.
VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions.
arXiv Detail & Related papers (2025-03-24T17:58:09Z) - BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module [11.898515581215708]
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks.
We introduce BrightVO, a novel VO model based on Transformer architecture, which performs front-end visual feature extraction.
Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness.
arXiv Detail & Related papers (2025-01-15T08:50:52Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - Parameter Hierarchical Optimization for Visible-Infrared Person Re-Identification [0.6675805308519986]
Visible-infrared person re-identification (VI-reID) aims at matching cross-modality pedestrian images captured by disjoint visible or infrared cameras.
We propose a novel parameter optimizing paradigm, parameter hierarchical optimization (PHO) method, for the task of VI-ReID.
It allows part of parameters to be directly optimized without any training, which narrows the search space of parameters and makes the whole network more easier to be trained.
arXiv Detail & Related papers (2024-04-11T17:27:39Z) - HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion [15.538174593176166]
In this study, we explore a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing.
Specifically, we design a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network.
This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner.
arXiv Detail & Related papers (2024-04-04T15:31:11Z) - ViTGaze: Gaze Following with Interaction Features in Vision Transformers [42.08842391756614]
We introduce a novel single-modality gaze following framework called ViTGaze.
In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders.
Our method achieves state-of-the-art (SOTA) performance among all single-modality methods.
arXiv Detail & Related papers (2024-03-19T14:45:17Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Transferring Modality-Aware Pedestrian Attentive Learning for
Visible-Infrared Person Re-identification [43.05147831905626]
We propose a novel Transferring Modality-Aware Pedestrian Attentive Learning (TMPA) model.
TMPA focuses on the pedestrian regions to efficiently compensate for missing modality-specific features.
experiments conducted on the benchmark SYSU-MM01 and RegDB datasets demonstrated the effectiveness of our proposed TMPA model.
arXiv Detail & Related papers (2023-12-12T07:15:17Z) - Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face
Anti-Spoofing [19.142582966452935]
We investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth.
We propose the modality-asymmetric masked autoencoder (M$2$A$2$E) for multimodal FAS self-supervised pre-training without costly annotated labels.
arXiv Detail & Related papers (2023-02-11T17:02:34Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - CycleTrans: Learning Neutral yet Discriminative Features for
Visible-Infrared Person Re-Identification [79.84912525821255]
Visible-infrared person re-identification (VI-ReID) is a task of matching the same individuals across the visible and infrared modalities.
Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability.
We present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans.
arXiv Detail & Related papers (2022-08-21T08:41:40Z) - Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing
Things [82.15959827765325]
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL)
We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability.
Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time.
arXiv Detail & Related papers (2022-07-14T10:04:18Z) - Visual Prompt Tuning [74.5309408185523]
This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision.
VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen.
arXiv Detail & Related papers (2022-03-23T01:17:16Z) - On Exploring Pose Estimation as an Auxiliary Learning Task for
Visible-Infrared Person Re-identification [66.58450185833479]
In this paper, we exploit Pose Estimation as an auxiliary learning task to assist the VI-ReID task in an end-to-end framework.
By jointly training these two tasks in a mutually beneficial manner, our model learns higher quality modality-shared and ID-related features.
Experimental results on two benchmark VI-ReID datasets show that the proposed method consistently improves state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-01-11T09:44:00Z) - Fully Differentiable and Interpretable Model for VIO with 4 Trainable
Parameters [16.347927939872488]
Monocular visual-inertial odometry is a critical problem in robotics and autonomous driving.
In this paper, we propose a fully differentiable, interpretable, and lightweight monocular VIO model that contains only 4 trainable parameters.
Experimental results on synthetic and real-world datasets demonstrate that our simple approach is competitive with state-of-the-art methods.
arXiv Detail & Related papers (2021-09-25T06:54:09Z) - CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models [101.5066760592534]
We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs)
CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap.
Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
arXiv Detail & Related papers (2021-09-24T08:07:29Z) - Neural Feature Search for RGB-Infrared Person Re-Identification [3.499870393443268]
We study a general paradigm, termed Neural Feature Search (NFS), to automate the process of feature selection.
NFS combines a dual-level feature search space and a differentiable search strategy to jointly select identity-related cues in coarse-grained channels and fine-grained spatial pixels.
Our method outperforms state-of-the-arts on mainstream benchmarks.
arXiv Detail & Related papers (2021-04-06T08:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.