Related papers: UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning

URL: http://arxiv.org/abs/2404.17360v3
Date: Tue, 15 Apr 2025 03:17:46 GMT
Title: UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning
Authors: Maoxun Yuan, Bo Cui, Tianyi Zhao, Jiayi Wang, Shan Fu, Xue Yang, Xingxing Wei,
Abstract summary: We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.<n>Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.<n> Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
Score: 19.510261890672165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Semantic analysis on visible (RGB) and infrared (IR) images has gained significant attention due to their enhanced accuracy and robustness under challenging conditions including low-illumination and adverse weather. However, due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. To address these limitations, we propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks that introduces a novel adapter mechanism to effectively incorporate rich multi-modal features into pre-trained RGB-based foundation models. Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module, and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at https://github.com/PoTsui99/UniRGB-IR.git.

Related papers

RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding [69.98331019544166]
Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality.<n>We propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities.
arXiv Detail & Related papers (2026-01-31T04:13:57Z)
SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection [28.779870703756668]
We propose SLGNet, a framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model.<n>SLGNet achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning.
arXiv Detail & Related papers (2026-01-05T16:31:41Z)
HyPSAM: Hybrid Prompt-driven Segment Anything Model for RGB-Thermal Salient Object Detection [75.406055413928]
We propose a novel prompt-driven segment anything model (HyPSAM) for RGB-T SOD.<n> DFNet employs dynamic convolution and multi-branch decoding to facilitate adaptive cross-modality interaction.<n>Experiments on three public datasets demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-09-23T07:32:11Z)
VELoRA: A Low-Rank Adaptation Approach for Efficient RGB-Event based Recognition [54.27379947727035]
This paper proposes a novel PEFT strategy to adapt the pre-trained foundation vision models for the RGB-Event-based classification. The frame difference of the dual modalities is also considered to capture the motion cues via the frame difference backbone network. The source code and pre-trained models will be released on urlhttps://github.com/Event-AHU/VELoRA.
arXiv Detail & Related papers (2024-12-28T07:38:23Z)
Modality Translation for Object Detection Adaptation Without Forgetting Prior Knowledge [11.905387325966311]
This paper focuses on adapting a large object detection model trained on RGB images to new data extracted from IR images. We propose Modality Translator (ModTr) as an alternative to the common approach of fine-tuning a large model to the new modality.
arXiv Detail & Related papers (2024-04-01T21:28:50Z)
Tensor Factorization for Leveraging Cross-Modal Knowledge in Data-Constrained Infrared Object Detection [22.60228799622782]
Key bottleneck in object detection in IR images is lack of sufficient labeled training data. We seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality. We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting.
arXiv Detail & Related papers (2023-09-28T16:55:52Z)
Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness. Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation. This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z)
CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE. Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z)
Students taught by multimodal teachers are superior action recognizers [41.821485757189656]
The focal point of egocentric video understanding is modelling hand-object interactions. Standard models -- CNNs, Vision Transformers, etc. -- which receive RGB frames as input perform well, however, their performance improves further by employing additional modalities such as object detections, optical flow, audio, etc. The goal of this work is to retain the performance of such multimodal approaches, while using only the RGB images as input at inference time.
arXiv Detail & Related papers (2022-10-09T19:37:17Z)
A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition. We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z)
Mirror Complementary Transformer Network for RGB-thermal Salient Object Detection [16.64781797503128]
RGB-thermal object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair. In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD. Experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-07T20:26:09Z)
Self-Supervised Modality-Aware Multiple Granularity Pre-Training for RGB-Infrared Person Re-Identification [9.624510941236837]
Modality-Aware Multiple Granularity Learning (MMGL) is a self-supervised pre-training alternative to ImageNet pre-training. MMGL learns better representations (+6.47% Rank-1) with faster training speed (converge in few hours) and solider data efficiency (5% data size) than ImageNet pre-training. Results suggest it generalizes well to various existing models, losses and has promising transferability across datasets.
arXiv Detail & Related papers (2021-12-12T04:40:33Z)
Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities. We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement. Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z)
Middle-level Fusion for Lightweight RGB-D Salient Object Detection [81.43951906434175]
A novel lightweight RGB-D SOD model is presented in this paper. With IMFF and L modules incorporated in the middle-level fusion structure, our proposed model has only 3.9M parameters and runs at 33 FPS. The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.
arXiv Detail & Related papers (2021-04-23T11:37:15Z)
Self-Supervised Representation Learning for RGB-D Salient Object Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation. Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts. For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z)
Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone. Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector. We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z)
Data-Level Recombination and Lightweight Fusion Scheme for RGB-D Salient Object Detection [73.31632581915201]
We propose a novel data-level recombination strategy to fuse RGB with D (depth) before deep feature extraction. A newly lightweight designed triple-stream network is applied over these novel formulated data to achieve an optimal channel-wise complementary fusion status between the RGB and D.
arXiv Detail & Related papers (2020-08-07T10:13:05Z)
Bi-directional Cross-Modality Feature Propagation with Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation. Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion. In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
Synergistic saliency and depth prediction for RGB-D saliency detection [76.27406945671379]
Existing RGB-D saliency datasets are small, which may lead to overfitting and limited generalization for diverse scenarios. We propose a semi-supervised system for RGB-D saliency detection that can be trained on smaller RGB-D saliency datasets without saliency ground truth.
arXiv Detail & Related papers (2020-07-03T14:24:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.