DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection
- URL: http://arxiv.org/abs/2012.04886v1
- Date: Wed, 9 Dec 2020 06:42:30 GMT
- Title: DS-Net: Dynamic Spatiotemporal Network for Video Salient Object
Detection
- Authors: Yuting Su, Weikang Wang, Jing Liu, Peiguang Jing and Xiaokang Yang
- Abstract summary: We propose a novel dynamic temporal-temporal network (DSNet) for more effective fusion of temporal and spatial information.
We show that the proposed method achieves superior performance than state-of-the-art algorithms.
- Score: 78.04869214450963
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As moving objects always draw more attention of human eyes, the temporal
motive information is always exploited complementarily with spatial information
to detect salient objects in videos. Although efficient tools such as optical
flow have been proposed to extract temporal motive information, it often
encounters difficulties when used for saliency detection due to the movement of
camera or the partial movement of salient objects. In this paper, we
investigate the complimentary roles of spatial and temporal information and
propose a novel dynamic spatiotemporal network (DS-Net) for more effective
fusion of spatiotemporal information. We construct a symmetric two-bypass
network to explicitly extract spatial and temporal features. A dynamic weight
generator (DWG) is designed to automatically learn the reliability of
corresponding saliency branch. And a top-down cross attentive aggregation (CAA)
procedure is designed so as to facilitate dynamic complementary aggregation of
spatiotemporal features. Finally, the features are modified by spatial
attention with the guidance of coarse saliency map and then go through decoder
part for final saliency map. Experimental results on five benchmarks VOS,
DAVIS, FBMS, SegTrack-v2, and ViSal demonstrate that the proposed method
achieves superior performance than state-of-the-art algorithms. The source code
is available at https://github.com/TJUMMG/DS-Net.
Related papers
- DroneAttention: Sparse Weighted Temporal Attention for Drone-Camera
Based Activity Recognition [2.705905918316948]
Human activity recognition (HAR) using drone-mounted cameras has attracted considerable interest from the computer vision research community in recent years.
We propose a novel Sparse Weighted Temporal Attention (SWTA) module to utilize sparsely sampled video frames for obtaining global weighted temporal attention.
The proposed model has received an accuracy of 72.76%, 92.56%, and 78.86% on the respective datasets.
arXiv Detail & Related papers (2022-12-07T00:33:40Z) - Motion-aware Memory Network for Fast Video Salient Object Detection [15.967509480432266]
We design a space-time memory (STM)-based network, which extracts useful temporal information of the current frame from adjacent frames as the temporal branch of VSOD.
In the encoding stage, we generate high-level temporal features by using high-level features from the current and its adjacent frames.
In the decoding stage, we propose an effective fusion strategy for spatial and temporal branches.
The proposed model does not require optical flow or other preprocessing, and can reach a speed of nearly 100 FPS during inference.
arXiv Detail & Related papers (2022-08-01T15:56:19Z) - Implicit Motion Handling for Video Camouflaged Object Detection [60.98467179649398]
We propose a new video camouflaged object detection (VCOD) framework.
It can exploit both short-term and long-term temporal consistency to detect camouflaged objects from video frames.
arXiv Detail & Related papers (2022-03-14T17:55:41Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - MFGNet: Dynamic Modality-Aware Filter Generation for RGB-T Tracking [72.65494220685525]
We propose a new dynamic modality-aware filter generation module (named MFGNet) to boost the message communication between visible and thermal data.
We generate dynamic modality-aware filters with two independent networks. The visible and thermal filters will be used to conduct a dynamic convolutional operation on their corresponding input feature maps respectively.
To address issues caused by heavy occlusion, fast motion, and out-of-view, we propose to conduct a joint local and global search by exploiting a new direction-aware target-driven attention mechanism.
arXiv Detail & Related papers (2021-07-22T03:10:51Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - TRAT: Tracking by Attention Using Spatio-Temporal Features [14.520067060603209]
We propose a two-stream deep neural network tracker that uses both spatial and temporal features.
Our architecture is developed over ATOM tracker and contains two backbones: (i) 2D-CNN network to capture appearance features and (ii) 3D-CNN network to capture motion features.
arXiv Detail & Related papers (2020-11-18T20:11:12Z) - Fast Video Salient Object Detection via Spatiotemporal Knowledge
Distillation [20.196945571479002]
We present a lightweight network tailored for video salient object detection.
Specifically, we combine a saliency guidance embedding structure and spatial knowledge distillation to refine the spatial features.
In the temporal aspect, we propose a temporal knowledge distillation strategy, which allows the network to learn the robust temporal features.
arXiv Detail & Related papers (2020-10-20T04:48:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.