DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for
Salient Object Detection
- URL: http://arxiv.org/abs/2203.06429v1
- Date: Sat, 12 Mar 2022 12:59:12 GMT
- Title: DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for
Salient Object Detection
- Authors: Heqin Zhu, Xu Sun, Yuexiang Li, Kai Ma, S. Kevin Zhou, Yefeng Zheng
- Abstract summary: We propose a pure Transformer-based SOD framework, namely Depth-supervised hierarchical feature Fusion TRansformer (DFTR)
We extensively evaluate the proposed DFTR on ten benchmarking datasets. Experimental results show that our DFTR consistently outperforms the existing state-of-the-art methods for both RGB and RGB-D SOD tasks.
- Score: 44.94166578314837
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automated salient object detection (SOD) plays an increasingly crucial role
in many computer vision applications. Although existing frameworks achieve
impressive SOD performances especially with the development of deep learning
techniques, their performances still have room for improvement. In this work,
we propose a novel pure Transformer-based SOD framework, namely
Depth-supervised hierarchical feature Fusion TRansformer (DFTR), to further
improve the accuracy of both RGB and RGB-D SOD. The proposed DFTR involves
three primary improvements: 1) The backbone of feature encoder is switched from
a convolutional neural network to a Swin Transformer for more effective feature
extraction; 2) We propose a multi-scale feature aggregation (MFA) module to
fully exploit the multi-scale features encoded by the Swin Transformer in a
coarse-to-fine manner; 3) Following recent studies, we formulate an auxiliary
task of depth map prediction and use the ground-truth depth maps as extra
supervision signals for network learning. To enable bidirectional information
flow between saliency and depth branches, a novel multi-task feature fusion
(MFF) module is integrated into our DFTR. We extensively evaluate the proposed
DFTR on ten benchmarking datasets. Experimental results show that our DFTR
consistently outperforms the existing state-of-the-art methods for both RGB and
RGB-D SOD tasks. The code and model will be released.
Related papers
- Depthformer : Multiscale Vision Transformer For Monocular Depth
Estimation With Local Global Information Fusion [6.491470878214977]
This paper benchmarks various transformer-based models for the depth estimation task on an indoor NYUV2 dataset and an outdoor KITTI dataset.
We propose a novel attention-based architecture, Depthformer for monocular depth estimation.
Our proposed method improves the state-of-the-art by 3.3%, and 3.3% respectively in terms of Root Mean Squared Error (RMSE)
arXiv Detail & Related papers (2022-07-10T20:49:11Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - SwinNet: Swin Transformer drives edge-aware RGB-D and RGB-T salient
object detection [12.126413875108993]
We propose a cross-modality fusion model SwinNet for RGB-D and RGB-T salient object detection.
The proposed model outperforms the state-of-the-art models on RGB-D and RGB-T datasets.
arXiv Detail & Related papers (2022-04-12T07:37:39Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Cross-modality Discrepant Interaction Network for RGB-D Salient Object
Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD.
Two components are designed to implement the effective cross-modality interaction.
Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.