UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning
- URL: http://arxiv.org/abs/2404.17360v1
- Date: Fri, 26 Apr 2024 12:21:57 GMT
- Title: UniRGB-IR: A Unified Framework for Visible-Infrared Downstream Tasks via Adapter Tuning
- Authors: Maoxun Yuan, Bo Cui, Tianyi Zhao, Xingxing Wei,
- Abstract summary: We propose a scalable and efficient framework called UniRGB-IR to unify RGB-IR downstream tasks.
Our framework consists of a transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module and a Supplementary Feature (SFI) module.
Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance.
- Score: 17.22733823085519
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Semantic analysis on visible (RGB) and infrared (IR) images has gained attention for its ability to be more accurate and robust under low-illumination and complex weather conditions. Due to the lack of pre-trained foundation models on the large-scale infrared image datasets, existing methods prefer to design task-specific frameworks and directly fine-tune them with pre-trained foundation models on their RGB-IR semantic relevance datasets, which results in poor scalability and limited generalization. In this work, we propose a scalable and efficient framework called UniRGB-IR to unify RGB-IR downstream tasks, in which a novel adapter is developed to efficiently introduce richer RGB-IR features into the pre-trained RGB-based foundation model. Specifically, our framework consists of a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (MFP) module and a Supplementary Feature Injector (SFI) module. The MFP and SFI modules cooperate with each other as an adpater to effectively complement the ViT features with the contextual multi-scale features. During training process, we freeze the entire foundation model to inherit prior knowledge and only optimize the MFP and SFI modules. Furthermore, to verify the effectiveness of our framework, we utilize the ViT-Base as the pre-trained foundation model to perform extensive experiments. Experimental results on various RGB-IR downstream tasks demonstrate that our method can achieve state-of-the-art performance. The source code and results are available at https://github.com/PoTsui99/UniRGB-IR.git.
Related papers
- Tensor Factorization for Leveraging Cross-Modal Knowledge in
Data-Constrained Infrared Object Detection [22.60228799622782]
Key bottleneck in object detection in IR images is lack of sufficient labeled training data.
We seek to leverage cues from the RGB modality to scale object detectors to the IR modality, while preserving model performance in the RGB modality.
We first pretrain these factor matrices on the RGB modality, for which plenty of training data are assumed to exist and then augment only a few trainable parameters for training on the IR modality to avoid over-fitting.
arXiv Detail & Related papers (2023-09-28T16:55:52Z) - Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness.
Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation.
This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - A Strong Transfer Baseline for RGB-D Fusion in Vision Transformers [0.0]
We propose a recipe for transferring pretrained ViTs in RGB-D domains for single-view 3D object recognition.
We show that our adapted ViTs score up to 95.1% top-1 accuracy in Washington, achieving new state-of-the-art results in this benchmark.
arXiv Detail & Related papers (2022-10-03T12:08:09Z) - Self-Supervised Modality-Aware Multiple Granularity Pre-Training for
RGB-Infrared Person Re-Identification [9.624510941236837]
Modality-Aware Multiple Granularity Learning (MMGL) is a self-supervised pre-training alternative to ImageNet pre-training.
MMGL learns better representations (+6.47% Rank-1) with faster training speed (converge in few hours) and solider data efficiency (5% data size) than ImageNet pre-training.
Results suggest it generalizes well to various existing models, losses and has promising transferability across datasets.
arXiv Detail & Related papers (2021-12-12T04:40:33Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Middle-level Fusion for Lightweight RGB-D Salient Object Detection [81.43951906434175]
A novel lightweight RGB-D SOD model is presented in this paper.
With IMFF and L modules incorporated in the middle-level fusion structure, our proposed model has only 3.9M parameters and runs at 33 FPS.
The experimental results on several benchmark datasets verify the effectiveness and superiority of the proposed method over some state-of-the-art methods.
arXiv Detail & Related papers (2021-04-23T11:37:15Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - DUT-LFSaliency: Versatile Dataset and Light Field-to-RGB Saliency
Detection [104.50425501764806]
We introduce a large-scale dataset to enable versatile applications for light field saliency detection.
We present an asymmetrical two-stream model consisting of the Focal stream and RGB stream.
Experiments demonstrate that our Focal stream achieves state-of-the-arts performance.
arXiv Detail & Related papers (2020-12-30T11:53:27Z) - Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone.
Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector.
We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.