TENet: Triple Excitation Network for Video Salient Object Detection
- URL: http://arxiv.org/abs/2007.09943v2
- Date: Sun, 30 Aug 2020 12:59:31 GMT
- Title: TENet: Triple Excitation Network for Video Salient Object Detection
- Authors: Sucheng Ren and Chu Han and Xin Yang and Guoqiang Han and Shengfeng He
- Abstract summary: We propose a simple yet effective approach, named Triple Excitation Network, to reinforce the training of video salient object detection (VSOD)
These excitation mechanisms are designed following the spirit of curriculum learning and aim to reduce learning at the beginning of training.
Our semi-curriculum learning design enables the first online strategy for VSOD, which allows exciting and boosting saliency responses during testing without re-training.
- Score: 57.72696926903698
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a simple yet effective approach, named Triple
Excitation Network, to reinforce the training of video salient object detection
(VSOD) from three aspects, spatial, temporal, and online excitations. These
excitation mechanisms are designed following the spirit of curriculum learning
and aim to reduce learning ambiguities at the beginning of training by
selectively exciting feature activations using ground truth. Then we gradually
reduce the weight of ground truth excitations by a curriculum rate and replace
it by a curriculum complementary map for better and faster convergence. In
particular, the spatial excitation strengthens feature activations for clear
object boundaries, while the temporal excitation imposes motions to emphasize
spatio-temporal salient regions. Spatial and temporal excitations can combat
the saliency shifting problem and conflict between spatial and temporal
features of VSOD. Furthermore, our semi-curriculum learning design enables the
first online refinement strategy for VSOD, which allows exciting and boosting
saliency responses during testing without re-training. The proposed triple
excitations can easily plug in different VSOD methods. Extensive experiments
show the effectiveness of all three excitation methods and the proposed method
outperforms state-of-the-art image and video salient object detection methods.
Related papers
- D$^2$ST-Adapter: Disentangled-and-Deformable Spatio-Temporal Adapter for Few-shot Action Recognition [60.84084172829169]
Adapting large pre-trained image models to few-shot action recognition has proven to be an effective strategy for learning robust feature extractors.
We present the Disentangled-and-Deformable Spatio-Temporal Adapter (D$2$ST-Adapter), which is a novel tuning framework well-suited for few-shot action recognition.
arXiv Detail & Related papers (2023-12-03T15:40:10Z) - Action Recognition with Multi-stream Motion Modeling and Mutual
Information Maximization [44.73161606369333]
Action recognition is a fundamental and intriguing problem in artificial intelligence.
We introduce a novel Stream-GCN network equipped with multi-stream components and channel attention.
Our approach sets the new state-of-the-art performance on three benchmark datasets.
arXiv Detail & Related papers (2023-06-13T06:56:09Z) - Towards Active Learning for Action Spotting in Association Football
Videos [59.84375958757395]
Analyzing football videos is challenging and requires identifying subtle and diverse-temporal patterns.
Current algorithms face significant challenges when learning from limited annotated data.
We propose an active learning framework that selects the most informative video samples to be annotated next.
arXiv Detail & Related papers (2023-04-09T11:50:41Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Activation to Saliency: Forming High-Quality Labels for Unsupervised
Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues.
No human annotations are involved in our framework during the whole training process.
Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z) - Guidance and Teaching Network for Video Salient Object Detection [38.22880271210646]
We propose a simple yet efficient architecture, termed Guidance and Teaching Network (GTNet)
GTNet distils effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level.
This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities.
arXiv Detail & Related papers (2021-05-21T03:25:38Z) - A Bioinspired Approach-Sensitive Neural Network for Collision Detection
in Cluttered and Dynamic Backgrounds [19.93930316898735]
Rapid accurate and robust detection of looming objects in moving backgrounds is a significant and challenging problem for robotic visual systems.
Inspired by the neural circuit elementary vision in the mammalian retina, this paper proposes a bioinspired approach-sensitive neural network (AS)
The proposed model is able to not only detect collision accurately and robustly in cluttered and dynamic backgrounds but also extract more collision information like position and direction, for guiding rapid decision making.
arXiv Detail & Related papers (2021-03-01T09:16:18Z) - Hierarchically Decoupled Spatial-Temporal Contrast for Self-supervised
Video Representation Learning [6.523119805288132]
We present a novel technique for self-supervised video representation learning by: (a) decoupling the learning objective into two contrastive subtasks respectively emphasizing spatial and temporal features, and (b) performing it hierarchically to encourage multi-scale understanding.
arXiv Detail & Related papers (2020-11-23T08:05:39Z) - A Novel Video Salient Object Detection Method via Semi-supervised Motion
Quality Perception [52.40934043694379]
This paper proposes a universal learning scheme to get a further 3% performance improvement for all state-of-the-art (SOTA) methods.
We resort to the "motion quality"---a brand new concept--to select a sub-group of video frames from the original testing set to construct a new training set.
The selected frames in this new training set should all contain high-quality motions, in which the salient objects will have large probability to be successfully detected by the "target SOTA method"
arXiv Detail & Related papers (2020-08-07T02:58:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.