NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
- URL: http://arxiv.org/abs/2207.10388v1
- Date: Thu, 21 Jul 2022 09:41:22 GMT
- Title: NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition
- Authors: Boyang Xia, Wenhao Wu, Haoran Wang, Rui Su, Dongliang He, Haosen Yang,
Xiaoran Fan, Wanli Ouyang
- Abstract summary: A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
- Score: 89.84188594758588
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: It is challenging for artificial intelligence systems to achieve accurate
video recognition under the scenario of low computation costs. Adaptive
inference based efficient video recognition methods typically preview videos
and focus on salient parts to reduce computation costs. Most existing works
focus on complex networks learning with video classification based objectives.
Taking all frames as positive samples, few of them pay attention to the
discrimination between positive samples (salient frames) and negative samples
(non-salient frames) in supervisions. To fill this gap, in this paper, we
propose a novel Non-saliency Suppression Network (NSNet), which effectively
suppresses the responses of non-salient frames. Specifically, on the frame
level, effective pseudo labels that can distinguish between salient and
non-salient frames are generated to guide the frame saliency learning. On the
video level, a temporal attention module is learned under dual video-level
supervisions on both the salient and the non-salient representations. Saliency
measurements from both two levels are combined for exploitation of
multi-granularity complementary information. Extensive experiments conducted on
four well-known benchmarks verify our NSNet not only achieves the
state-of-the-art accuracy-efficiency trade-off but also present a significantly
faster (2.4~4.3x) practical inference speed than state-of-the-art methods. Our
project page is at https://lawrencexia2008.github.io/projects/nsnet .
Related papers
- SSVOD: Semi-Supervised Video Object Detection with Sparse Annotations [12.139451002212063]
SSVOD exploits motion dynamics of videos to utilize large-scale unlabeled frames with sparse annotations.
Our method achieves significant performance improvements over existing methods on ImageNet-VID, Epic-KITCHENS, and YouTube-VIS.
arXiv Detail & Related papers (2023-09-04T06:41:33Z) - Look More but Care Less in Video Recognition [57.96505328398205]
Action recognition methods typically sample a few frames to represent each video to avoid the enormous computation.
We propose Ample and Focal Network (AFNet), which is composed of two branches to utilize more frames but with less computation.
arXiv Detail & Related papers (2022-11-18T02:39:56Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Frame-rate Up-conversion Detection Based on Convolutional Neural Network
for Learning Spatiotemporal Features [7.895528973776606]
This paper proposes a frame-rate conversion detection network (FCDNet) that learns forensic features caused by FRUC in an end-to-end fashion.
FCDNet uses a stack of consecutive frames as the input and effectively learns artifacts using network blocks to learn features.
arXiv Detail & Related papers (2021-03-25T08:47:46Z) - Semi-Supervised Action Recognition with Temporal Contrastive Learning [50.08957096801457]
We learn a two-pathway temporal contrastive model using unlabeled videos at two different speeds.
We considerably outperform video extensions of sophisticated state-of-the-art semi-supervised image recognition methods.
arXiv Detail & Related papers (2021-02-04T17:28:35Z) - A Self-Reasoning Framework for Anomaly Detection Using Video-Level
Labels [17.615297975503648]
Alous event detection in surveillance videos is a challenging and practical research problem among image and video processing community.
We propose a weakly supervised anomaly detection framework based on deep neural networks which is trained in a self-reasoning fashion using only video-level labels.
The proposed framework has been evaluated on publicly available real-world anomaly detection datasets including UCF-crime, ShanghaiTech and Ped2.
arXiv Detail & Related papers (2020-08-27T02:14:15Z) - Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos.
Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization.
We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction.
Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.