MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object
Detection
- URL: http://arxiv.org/abs/2112.01177v1
- Date: Thu, 2 Dec 2021 12:48:37 GMT
- Title: MTFNet: Mutual-Transformer Fusion Network for RGB-D Salient Object
Detection
- Authors: Xixi Wang, Bo Jiang, Xiao Wang, Bin Luo
- Abstract summary: We propose a novel Mutual-Transformer Fusion Network (MTFNet) for RGB-D SOD.
MTFNet contains two main modules, $i.e.$, Focal Feature Extractor (FFE) and Mutual-Transformer Fusion (MTF)
Comprehensive experimental results on six public benchmarks demonstrate the superiority of our proposed MTFNet.
- Score: 15.371153771528093
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Salient object detection (SOD) on RGB-D images is an active problem in
computer vision. The main challenges for RGB-D SOD problem are how to 1)
extract the accurate features for RGB and Depth image data with clutter
background or poor image quality and 2) explore the complementary information
between RGB and Depth image data. To address these challenges, we propose a
novel Mutual-Transformer Fusion Network (MTFNet) for RGB-D SOD. MTFNet contains
two main modules, $i.e.$, Focal Feature Extractor (FFE) and Mutual-Transformer
Fusion (MTF). FFE aims to extract the more accurate CNN features for RGB and
Depth images by introducing a novel pixel-level focal regularization to guide
CNN feature extractor. MTF is designed to deeply exploit the multi-modal
interaction between RGB and Depth images on both coarse and fine scales. The
main benefit of MTF is that it conducts the learning of intra-modality and
inter-modality simultaneously and thus can achieve communication across
different modalities more directly and sufficiently. Comprehensive experimental
results on six public benchmarks demonstrate the superiority of our proposed
MTFNet.
Related papers
- MambaSOD: Dual Mamba-Driven Cross-Modal Fusion Network for RGB-D Salient Object Detection [10.848413136031025]
The purpose of RGB-D Salient Object Detection (SOD) is to pinpoint the most visually conspicuous areas within images accurately.
We propose a dual Mamba-driven cross-modal fusion network for RGB-D SOD, named MambaSOD.
arXiv Detail & Related papers (2024-10-19T07:08:40Z) - HODINet: High-Order Discrepant Interaction Network for RGB-D Salient
Object Detection [4.007827908611563]
RGB-D salient object detection (SOD) aims to detect the prominent regions by jointly modeling RGB and depth information.
Most RGB-D SOD methods apply the same type of backbones and fusion modules to identically learn the multimodality and multistage features.
In this paper, we propose a high-order discrepant interaction network (HODINet) for RGB-D SOD.
arXiv Detail & Related papers (2023-07-03T11:56:21Z) - Interactive Context-Aware Network for RGB-T Salient Object Detection [7.544240329265388]
We propose a novel network called Interactive Context-Aware Network (ICANet)
ICANet contains three modules that can effectively perform the cross-modal and cross-scale fusions.
Experiments prove that our network performs favorably against the state-of-the-art RGB-T SOD methods.
arXiv Detail & Related papers (2022-11-11T10:04:36Z) - MAFNet: A Multi-Attention Fusion Network for RGB-T Crowd Counting [40.4816930622052]
We propose a two-stream RGB-T crowd counting network called Multi-Attention Fusion Network (MAFNet)
In the encoder part, a Multi-Attention Fusion (MAF) module is embedded into different stages of the two modality-specific branches for cross-modal fusion.
Extensive experiments on two popular datasets show that the proposed MAFNet is effective for RGB-T crowd counting.
arXiv Detail & Related papers (2022-08-14T02:42:09Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Cross-modality Discrepant Interaction Network for RGB-D Salient Object
Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD.
Two components are designed to implement the effective cross-modality interaction.
Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z) - Is Depth Really Necessary for Salient Object Detection? [50.10888549190576]
We make the first attempt in realizing an unified depth-aware framework with only RGB information as input for inference.
Not only surpasses the state-of-the-art performances on five public RGB SOD benchmarks, but also surpasses the RGBD-based methods on five benchmarks by a large margin.
arXiv Detail & Related papers (2020-05-30T13:40:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.