Guidance and Teaching Network for Video Salient Object Detection
- URL: http://arxiv.org/abs/2105.10110v1
- Date: Fri, 21 May 2021 03:25:38 GMT
- Title: Guidance and Teaching Network for Video Salient Object Detection
- Authors: Ge-Peng Ji, Xiao Wang, Yu-Cheng Chou, Yuming Fang, Shouyuan Yang, Rong
Zhu, Ge Gao
- Abstract summary: We propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet)
GTNet distils effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level.
This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities.
- Score: 38.22880271210646
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Owing to the difficulties of mining spatial-temporal cues, the existing
approaches for video salient object detection (VSOD) are limited in
understanding complex and noisy scenarios, and often fail in inferring
prominent objects. To alleviate such shortcomings, we propose a simple yet
efficient architecture, termed Guidance and Teaching Network (GTNet), to
independently distil effective spatial and temporal cues with implicit guidance
and explicit teaching at feature- and decision-level, respectively. To be
specific, we (a) introduce a temporal modulator to implicitly bridge features
from motion into the appearance branch, which is capable of fusing cross-modal
features collaboratively, and (b) utilise motion-guided mask to propagate the
explicit cues during the feature aggregation. This novel learning strategy
achieves satisfactory results via decoupling the complex spatial-temporal cues
and mapping informative cues across different modalities. Extensive experiments
on three challenging benchmarks show that the proposed method can run at ~28
fps on a single TITAN Xp GPU and perform competitively against 14 cutting-edge
baselines.
Related papers
- Motion Aware Self-Supervision for Generic Event Boundary Detection [14.637933739152315]
Generic Event Boundary Detection (GEBD) aims to detect moments in videos that are naturally perceived by humans as generic and taxonomy-free event boundaries.
Existing approaches involve very complex and sophisticated pipelines in terms of architectural design choices.
We revisit a simple and effective self-supervised method and augment it with a differentiable motion feature learning module to tackle the spatial and temporal diversities in the GEBD task.
arXiv Detail & Related papers (2022-10-11T16:09:13Z) - Spatiotemporal Multi-scale Bilateral Motion Network for Gait Recognition [3.1240043488226967]
In this paper, motivated by optical flow, the bilateral motion-oriented features are proposed.
We develop a set of multi-scale temporal representations that force the motion context to be richly described at various levels of temporal resolution.
arXiv Detail & Related papers (2022-09-26T01:36:22Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - Confidence-guided Adaptive Gate and Dual Differential Enhancement for
Video Salient Object Detection [47.68968739917077]
Video salient object detection (VSOD) aims to locate and segment the most attractive object by exploiting both spatial cues and temporal cues hidden in video sequences.
We propose a new framework to adaptively capture available information from spatial and temporal cues, which contains Confidence-guided Adaptive Gate (CAG) modules and Dual Differential Enhancement (DDE) modules.
arXiv Detail & Related papers (2021-05-14T08:49:37Z) - Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised
Video Representation Learning [6.523119805288132]
We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding.
arXiv Detail & Related papers (2020-11-23T08:05:39Z) - BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded
Dialogues [95.8297116307127]
We propose Bi-directional Spatio-Temporal Learning (BiST), a vision-language neural framework for high-resolution queries in videos.
Specifically, our approach exploits both spatial and temporal-level information, and learns dynamic information diffusion between the two feature spaces.
BiST achieves competitive performance and generates reasonable responses on a large-scale AVSD benchmark.
arXiv Detail & Related papers (2020-10-20T07:43:00Z) - Fast Video Salient Object Detection via Spatiotemporal Knowledge
Distillation [20.196945571479002]
We present a lightweight network tailored for video salient object detection.
Specifically, we combine a saliency guidance embedding structure and spatial knowledge distillation to refine the spatial features.
In the temporal aspect, we propose a temporal knowledge distillation strategy, which allows the network to learn the robust temporal features.
arXiv Detail & Related papers (2020-10-20T04:48:36Z) - TENet: Triple Excitation Network for Video Salient Object Detection [57.72696926903698]
We propose a simple yet effective approach, named Triple Excitation Network, to reinforce the training of video salient object detection (VSOD)
These excitation mechanisms are designed following the spirit of curriculum learning and aim to reduce learning at the beginning of training.
Our semi-curriculum learning design enables the first online strategy for VSOD, which allows exciting and boosting saliency responses during testing without re-training.
arXiv Detail & Related papers (2020-07-20T08:45:41Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z) - Depthwise Non-local Module for Fast Salient Object Detection Using a
Single Thread [136.2224792151324]
We propose a new deep learning algorithm for fast salient object detection.
The proposed algorithm achieves competitive accuracy and high inference efficiency simultaneously with a single CPU thread.
arXiv Detail & Related papers (2020-01-22T15:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.