A Single Stream Network for Robust and Real-time RGB-D Salient Object
Detection
- URL: http://arxiv.org/abs/2007.06811v2
- Date: Wed, 15 Jul 2020 02:00:22 GMT
- Title: A Single Stream Network for Robust and Real-time RGB-D Salient Object
Detection
- Authors: Xiaoqi Zhao, Lihe Zhang, Youwei Pang, Huchuan Lu, Lei Zhang
- Abstract summary: We design a single stream network to use the depth map to guide early fusion and middle fusion between RGB and depth.
This model is 55.5% lighter than the current lightest model and runs at a real-time speed of 32 FPS when processing a $384 times 384$ image.
- Score: 89.88222217065858
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing RGB-D salient object detection (SOD) approaches concentrate on the
cross-modal fusion between the RGB stream and the depth stream. They do not
deeply explore the effect of the depth map itself. In this work, we design a
single stream network to directly use the depth map to guide early fusion and
middle fusion between RGB and depth, which saves the feature encoder of the
depth stream and achieves a lightweight and real-time model. We tactfully
utilize depth information from two perspectives: (1) Overcoming the
incompatibility problem caused by the great difference between modalities, we
build a single stream encoder to achieve the early fusion, which can take full
advantage of ImageNet pre-trained backbone model to extract rich and
discriminative features. (2) We design a novel depth-enhanced dual attention
module (DEDA) to efficiently provide the fore-/back-ground branches with the
spatially filtered features, which enables the decoder to optimally perform the
middle fusion. Besides, we put forward a pyramidally attended feature
extraction module (PAFE) to accurately localize the objects of different
scales. Extensive experiments demonstrate that the proposed model performs
favorably against most state-of-the-art methods under different evaluation
metrics. Furthermore, this model is 55.5\% lighter than the current lightest
model and runs at a real-time speed of 32 FPS when processing a $384 \times
384$ image.
Related papers
- RBF Weighted Hyper-Involution for RGB-D Object Detection [0.0]
We propose a real-time and two stream RGBD object detection model.
The proposed model consists of two new components: a depth guided hyper-involution that adapts dynamically based on the spatial interaction pattern in the raw depth map and an up-sampling based trainable fusion layer.
We show that the proposed model outperforms other RGB-D based object detection models on NYU Depth v2 dataset and achieves comparable (second best) results on SUN RGB-D.
arXiv Detail & Related papers (2023-09-30T11:25:34Z) - Pyramid Deep Fusion Network for Two-Hand Reconstruction from RGB-D Images [11.100398985633754]
We propose an end-to-end framework for recovering dense meshes for both hands.
Our framework employs ResNet50 and PointNet++ to derive features from RGB and point cloud.
We also introduce a novel pyramid deep fusion network (PDFNet) to aggregate features at different scales.
arXiv Detail & Related papers (2023-07-12T09:33:21Z) - Joint Learning of Salient Object Detection, Depth Estimation and Contour
Extraction [91.43066633305662]
We propose a novel multi-task and multi-modal filtered transformer (MMFT) network for RGB-D salient object detection (SOD)
Specifically, we unify three complementary tasks: depth estimation, salient object detection and contour estimation. The multi-task mechanism promotes the model to learn the task-aware features from the auxiliary tasks.
Experiments show that it not only significantly surpasses the depth-based RGB-D SOD methods on multiple datasets, but also precisely predicts a high-quality depth map and salient contour at the same time.
arXiv Detail & Related papers (2022-03-09T17:20:18Z) - Depth-Cooperated Trimodal Network for Video Salient Object Detection [13.727763221832532]
We propose a depth-operated triOD network called DCTNet for video salient object detection (VS)
To this end, we first generate depth from RGB frames, and then propose an approach to treat the three modalities unequally.
We also introduce a refinement fusion module (RFM) to suppress noises in each modality and select useful information dynamically for further feature refinement.
arXiv Detail & Related papers (2022-02-12T13:04:16Z) - EPMF: Efficient Perception-aware Multi-sensor Fusion for 3D Semantic Segmentation [62.210091681352914]
We study multi-sensor fusion for 3D semantic segmentation for many applications, such as autonomous driving and robotics.
In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF)
We propose a two-stream network to extract features from the two modalities separately. The extracted features are fused by effective residual-based fusion modules.
arXiv Detail & Related papers (2021-06-21T10:47:26Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z) - Progressively Guided Alternate Refinement Network for RGB-D Salient
Object Detection [63.18846475183332]
We aim to develop an efficient and compact deep network for RGB-D salient object detection.
We propose a progressively guided alternate refinement network to refine it.
Our model outperforms existing state-of-the-art approaches by a large margin.
arXiv Detail & Related papers (2020-08-17T02:55:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.