IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks
- URL: http://arxiv.org/abs/2412.16654v2
- Date: Tue, 18 Mar 2025 07:52:24 GMT
- Title: IV-tuning: Parameter-Efficient Transfer Learning for Infrared-Visible Tasks
- Authors: Yaming Zhang, Chenqiang Gao, Fangcen Liu, Junjie Guo, Lan Wang, Xinggan Peng, Deyu Meng,
- Abstract summary: "IV-tuning" is a novel and general fine-tuning approach to parameter-efficiently harness PVMs for infrared-visible tasks.<n>At its core, IV-tuning freezes pre-trained visible-based PVMs and integrates infrared flow into modal prompts to interact with adapters.<n>By fine-tuning approximately 3% of the backbone parameters, IV-tuning outperforms full fine-tuning and previous state-of-the-art methods.
- Score: 47.08388430506686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Various infrared-visible (IR-VIS) tasks greatly benefit from the advantage of combining infrared and visible modalities. Driven by the motivation that streamlining the infrared flow and harnessing PVMs with fewer parameters for superior performance, we propose "IV-tuning", a novel and general fine-tuning approach, to parameter-efficiently harness PVMs for various infrared-visible downstream tasks. At its core, IV-tuning freezes pre-trained visible-based PVMs and integrates infrared flow into modal prompts to interact with adapters, which achieves a more efficient and general modal interaction paradigm. By fine-tuning approximately 3% of the backbone parameters, IV-tuning outperforms full fine-tuning and previous state-of-the-art methods across multiple baselines in multiple tasks, including IR-VIS salient object detection, semantic segmentation and object detection. Extensive experiments demonstrate that IV-tuning achieves superior performance with fewer trainable parameters, providing a good alternative to full fine-tuning and a novel method of extending visible-based models for infrared-visible tasks. The code will be provided in supplementary material.
Related papers
- Vision-Centric Representation-Efficient Fine-Tuning for Robust Universal Foreground Segmentation [5.326302374594885]
Foreground segmentation is crucial for scene understanding, yet parameter-efficient fine-tuning (PEFT) of vision foundation models (VFMs) often fails in complex scenarios.
We propose Ladder Shape-bias Representation Side-tuning (LSR-ST), a lightweight PEFT framework that enhances model robustness by introducing shape-biased inductive priors.
arXiv Detail & Related papers (2025-04-20T04:12:38Z) - DiffV2IR: Visible-to-Infrared Diffusion Model via Vision-Language Understanding [43.85632218045282]
We introduce DiffV2IR, a novel framework for image translation comprising two key elements: a Progressive Learning Module (PLM) and a Vision-Language Understanding Module (VLUM)
PLM features an adaptive diffusion model architecture that leverages multi-stage knowledge learning to infrared transition from full-range to target wavelength.
VLUM incorporates unified Vision-Language Understanding. We also collected a large infrared dataset, IR-500K, which includes 500,000 infrared images compiled by various scenes and objects under various environmental conditions.
arXiv Detail & Related papers (2025-03-24T17:58:09Z) - BRIGHT-VO: Brightness-Guided Hybrid Transformer for Visual Odometry with Multi-modality Refinement Module [11.898515581215708]
Visual odometry (VO) plays a crucial role in autonomous driving, robotic navigation, and other related tasks.
We introduce BrightVO, a novel VO model based on Transformer architecture, which performs front-end visual feature extraction.
Using pose graph optimization, this module iteratively refines pose estimates to reduce errors and improve both accuracy and robustness.
arXiv Detail & Related papers (2025-01-15T08:50:52Z) - Visual Fourier Prompt Tuning [63.66866445034855]
We propose the Visual Fourier Prompt Tuning (VFPT) method as a general and effective solution for adapting large-scale transformer-based models.
Our approach incorporates the Fast Fourier Transform into prompt embeddings and harmoniously considers both spatial and frequency domain information.
Our results demonstrate that our approach outperforms current state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2024-11-02T18:18:35Z) - HAPNet: Toward Superior RGB-Thermal Scene Parsing via Hybrid, Asymmetric, and Progressive Heterogeneous Feature Fusion [15.538174593176166]
In this study, we explore a feasible strategy to fully exploit VFM features for RGB-thermal scene parsing.
Specifically, we design a hybrid, asymmetric encoder that incorporates both a VFM and a convolutional neural network.
This design allows for more effective extraction of complementary heterogeneous features, which are subsequently fused in a dual-path, progressive manner.
arXiv Detail & Related papers (2024-04-04T15:31:11Z) - ViTGaze: Gaze Following with Interaction Features in Vision Transformers [42.08842391756614]
We introduce a novel single-modality gaze following framework called ViTGaze.
In contrast to previous methods, it creates a novel gaze following framework based mainly on powerful encoders.
Our method achieves state-of-the-art (SOTA) performance among all single-modality methods.
arXiv Detail & Related papers (2024-03-19T14:45:17Z) - VeCAF: Vision-language Collaborative Active Finetuning with Training Objective Awareness [56.87603097348203]
VeCAF uses labels and natural language annotations to perform parametric data selection for PVM finetuning.
VeCAF incorporates the finetuning objective to select significant data points that effectively guide the PVM towards faster convergence.
On ImageNet, VeCAF uses up to 3.3x less training batches to reach the target performance compared to full finetuning.
arXiv Detail & Related papers (2024-01-15T17:28:37Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Transferring Modality-Aware Pedestrian Attentive Learning for
Visible-Infrared Person Re-identification [43.05147831905626]
We propose a novel Transferring Modality-Aware Pedestrian Attentive Learning (TMPA) model.
TMPA focuses on the pedestrian regions to efficiently compensate for missing modality-specific features.
experiments conducted on the benchmark SYSU-MM01 and RegDB datasets demonstrated the effectiveness of our proposed TMPA model.
arXiv Detail & Related papers (2023-12-12T07:15:17Z) - Rethinking Vision Transformer and Masked Autoencoder in Multimodal Face
Anti-Spoofing [19.142582966452935]
We investigate three key factors (i.e., inputs, pre-training, and finetuning) in ViT for multimodal FAS with RGB, Infrared (IR), and Depth.
We propose the modality-asymmetric masked autoencoder (M$2$A$2$E) for multimodal FAS self-supervised pre-training without costly annotated labels.
arXiv Detail & Related papers (2023-02-11T17:02:34Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - CycleTrans: Learning Neutral yet Discriminative Features for
Visible-Infrared Person Re-Identification [79.84912525821255]
Visible-infrared person re-identification (VI-ReID) is a task of matching the same individuals across the visible and infrared modalities.
Existing VI-ReID methods mainly focus on learning general features across modalities, often at the expense of feature discriminability.
We present a novel cycle-construction-based network for neutral yet discriminative feature learning, termed CycleTrans.
arXiv Detail & Related papers (2022-08-21T08:41:40Z) - Visual Prompt Tuning [74.5309408185523]
This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision.
VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen.
arXiv Detail & Related papers (2022-03-23T01:17:16Z) - CPT: Colorful Prompt Tuning for Pre-trained Vision-Language Models [101.5066760592534]
We present Cross-modal Prompt Tuning (CPT), a novel paradigm for tuning Vision-Language Models (VL-PTMs)
CPT reformulates visual grounding into a fill-in-the-blank problem with color-based co-referential markers in image and text, maximally mitigating the gap.
Comprehensive experimental results show that prompt tuned VL-PTMs outperform their fine-tuned counterparts by a large margin.
arXiv Detail & Related papers (2021-09-24T08:07:29Z) - Neural Feature Search for RGB-Infrared Person Re-Identification [3.499870393443268]
We study a general paradigm, termed Neural Feature Search (NFS), to automate the process of feature selection.
NFS combines a dual-level feature search space and a differentiable search strategy to jointly select identity-related cues in coarse-grained channels and fine-grained spatial pixels.
Our method outperforms state-of-the-arts on mainstream benchmarks.
arXiv Detail & Related papers (2021-04-06T08:40:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.