Multi-Modal Hybrid Learning and Sequential Training for RGB-T Saliency
Detection
- URL: http://arxiv.org/abs/2309.07297v1
- Date: Wed, 13 Sep 2023 20:47:29 GMT
- Title: Multi-Modal Hybrid Learning and Sequential Training for RGB-T Saliency
Detection
- Authors: Guangyu Ren, Jitesh Joshi, Youngjun Cho
- Abstract summary: RGB-T saliency detection has emerged as an important computer vision task, identifying conspicuous objects in challenging scenes such as dark environments.
Existing methods neglect the characteristics of cross-modal features and rely solely on network structures to fuse RGB and thermal features.
We first propose a Multi-Modal Hybrid loss (MMHL) that comprises supervised and self-supervised loss functions.
- Score: 10.589062261564631
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RGB-T saliency detection has emerged as an important computer vision task,
identifying conspicuous objects in challenging scenes such as dark
environments. However, existing methods neglect the characteristics of
cross-modal features and rely solely on network structures to fuse RGB and
thermal features. To address this, we first propose a Multi-Modal Hybrid loss
(MMHL) that comprises supervised and self-supervised loss functions. The
supervised loss component of MMHL distinctly utilizes semantic features from
different modalities, while the self-supervised loss component reduces the
distance between RGB and thermal features. We further consider both spatial and
channel information during feature fusion and propose the Hybrid Fusion Module
to effectively fuse RGB and thermal features. Lastly, instead of jointly
training the network with cross-modal features, we implement a sequential
training strategy which performs training only on RGB images in the first stage
and then learns cross-modal features in the second stage. This training
strategy improves saliency detection performance without computational
overhead. Results from performance evaluation and ablation studies demonstrate
the superior performance achieved by the proposed method compared with the
existing state-of-the-art methods.
Related papers
- Channel and Spatial Relation-Propagation Network for RGB-Thermal
Semantic Segmentation [10.344060599932185]
RGB-Thermal (RGB-T) semantic segmentation has shown great potential in handling low-light conditions.
The key to RGB-T semantic segmentation is to effectively leverage the complementarity nature of RGB and thermal images.
arXiv Detail & Related papers (2023-08-24T03:43:47Z) - Residual Spatial Fusion Network for RGB-Thermal Semantic Segmentation [19.41334573257174]
Traditional methods mostly use RGB images which are heavily affected by lighting conditions, eg, darkness.
Recent studies show thermal images are robust to the night scenario as a compensating modality for segmentation.
This work proposes a Residual Spatial Fusion Network (RSFNet) for RGB-T semantic segmentation.
arXiv Detail & Related papers (2023-06-17T14:28:08Z) - CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets [50.6643933702394]
We present a single-model self-supervised hybrid pre-training framework for RGB and depth modalities, termed as CoMAE.
Our CoMAE presents a curriculum learning strategy to unify the two popular self-supervised representation learning algorithms: contrastive learning and masked image modeling.
arXiv Detail & Related papers (2023-02-13T07:09:45Z) - Interactive Context-Aware Network for RGB-T Salient Object Detection [7.544240329265388]
We propose a novel network called Interactive Context-Aware Network (ICANet)
ICANet contains three modules that can effectively perform the cross-modal and cross-scale fusions.
Experiments prove that our network performs favorably against the state-of-the-art RGB-T SOD methods.
arXiv Detail & Related papers (2022-11-11T10:04:36Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Modality-Adaptive Mixup and Invariant Decomposition for RGB-Infrared
Person Re-Identification [84.32086702849338]
We propose a novel modality-adaptive mixup and invariant decomposition (MID) approach for RGB-infrared person re-identification.
MID designs a modality-adaptive mixup scheme to generate suitable mixed modality images between RGB and infrared images.
Experiments on two challenging benchmarks demonstrate superior performance of MID over state-of-the-art methods.
arXiv Detail & Related papers (2022-03-03T14:26:49Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Multi-Scale Cascading Network with Compact Feature Learning for
RGB-Infrared Person Re-Identification [35.55895776505113]
Multi-Scale Part-Aware Cascading framework (MSPAC) is formulated by aggregating multi-scale fine-grained features from part to global.
Cross-modality correlations can thus be efficiently explored on salient features for distinctive modality-invariant feature learning.
arXiv Detail & Related papers (2020-12-12T15:39:11Z) - Siamese Network for RGB-D Salient Object Detection and Beyond [113.30063105890041]
A novel framework is proposed to learn from both RGB and depth inputs through a shared network backbone.
Comprehensive experiments using five popular metrics show that the designed framework yields a robust RGB-D saliency detector.
We also link JL-DCF to the RGB-D semantic segmentation field, showing its capability of outperforming several semantic segmentation models.
arXiv Detail & Related papers (2020-08-26T06:01:05Z) - Two-Level Attention-based Fusion Learning for RGB-D Face Recognition [21.735238213921804]
A novel attention aware method is proposed to fuse two image modalities, RGB and depth, for enhanced RGB-D facial recognition.
The proposed method first extracts features from both modalities using a convolutional feature extractor.
These features are then fused using a two-layer attention mechanism.
arXiv Detail & Related papers (2020-02-29T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.