Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
- URL: http://arxiv.org/abs/2302.08052v1
- Date: Thu, 16 Feb 2023 03:23:23 GMT
- Title: Hierarchical Cross-modal Transformer for RGB-D Salient Object Detection
- Authors: Hao Chen and Feihong Shen
- Abstract summary: We propose the Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to tackle this problem.
Unlike previous multi-modal transformers that directly connecting all patches from two modalities, we explore the cross-modal complementarity hierarchically.
We present a Feature Pyramid module for Transformer (FPT) to boost informative cross-scale integration as well as a consistency-complementarity module to disentangle the multi-modal integration path.
- Score: 6.385624548310884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of existing RGB-D salient object detection (SOD) methods follow the
CNN-based paradigm, which is unable to model long-range dependencies across
space and modalities due to the natural locality of CNNs. Here we propose the
Hierarchical Cross-modal Transformer (HCT), a new multi-modal transformer, to
tackle this problem. Unlike previous multi-modal transformers that directly
connecting all patches from two modalities, we explore the cross-modal
complementarity hierarchically to respect the modality gap and spatial
discrepancy in unaligned regions. Specifically, we propose to use intra-modal
self-attention to explore complementary global contexts, and measure
spatial-aligned inter-modal attention locally to capture cross-modal
correlations. In addition, we present a Feature Pyramid module for Transformer
(FPT) to boost informative cross-scale integration as well as a
consistency-complementarity module to disentangle the multi-modal integration
path and improve the fusion adaptivity. Comprehensive experiments on a large
variety of public datasets verify the efficacy of our designs and the
consistent improvement over state-of-the-art models.
Related papers
- Point-aware Interaction and CNN-induced Refinement Network for RGB-D
Salient Object Detection [95.84616822805664]
We introduce CNNs-assisted Transformer architecture and propose a novel RGB-D SOD network with Point-aware Interaction and CNN-induced Refinement.
In order to alleviate the block effect and detail destruction problems brought by the Transformer naturally, we design a CNN-induced refinement (CNNR) unit for content refinement and supplementation.
arXiv Detail & Related papers (2023-08-17T11:57:49Z) - Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z) - SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient
object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection.
The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - TransCMD: Cross-Modal Decoder Equipped with Transformer for RGB-D
Salient Object Detection [86.94578023985677]
In this work, we rethink this task from the perspective of global information alignment and transformation.
Specifically, the proposed method (TransCMD) cascades several cross-modal integration units to construct a top-down transformer-based information propagation path.
Experimental results on seven RGB-D SOD benchmark datasets demonstrate that a simple two-stream encoder-decoder framework can surpass the state-of-the-art purely CNN-based methods.
arXiv Detail & Related papers (2021-12-04T15:45:34Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z) - MSO: Multi-Feature Space Joint Optimization Network for RGB-Infrared
Person Re-Identification [35.97494894205023]
RGB-infrared cross-modality person re-identification (ReID) task aims to recognize the images of the same identity between the visible modality and the infrared modality.
Existing methods mainly use a two-stream architecture to eliminate the discrepancy between the two modalities in the final common feature space.
We present a novel multi-feature space joint optimization (MSO) network, which can learn modality-sharable features in both the single-modality space and the common space.
arXiv Detail & Related papers (2021-10-21T16:45:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.