Dynamic Sampling Networks for Efficient Action Recognition in Videos
- URL: http://arxiv.org/abs/2006.15560v1
- Date: Sun, 28 Jun 2020 09:48:29 GMT
- Title: Dynamic Sampling Networks for Efficient Action Recognition in Videos
- Authors: Yin-Dong Zheng, Zhaoyang Liu, Tong Lu, Limin Wang
- Abstract summary: We propose a new framework for action recognition in videos, called em Dynamic Sampling Networks (DSN)
DSN is composed of a sampling module and a classification module, whose objective is to learn a sampling policy to on-the-fly select which clips to keep and train a clip-level classifier to perform action recognition based on these selected clips, respectively.
We study different aspects of the DSN framework on four action recognition datasets: UCF101, HMDB51, THUMOS14, and ActivityNet v1.3.
- Score: 43.51012099839094
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The existing action recognition methods are mainly based on clip-level
classifiers such as two-stream CNNs or 3D CNNs, which are trained from the
randomly selected clips and applied to densely sampled clips during testing.
However, this standard setting might be suboptimal for training classifiers and
also requires huge computational overhead when deployed in practice. To address
these issues, we propose a new framework for action recognition in videos,
called {\em Dynamic Sampling Networks} (DSN), by designing a dynamic sampling
module to improve the discriminative power of learned clip-level classifiers
and as well increase the inference efficiency during testing. Specifically, DSN
is composed of a sampling module and a classification module, whose objective
is to learn a sampling policy to on-the-fly select which clips to keep and
train a clip-level classifier to perform action recognition based on these
selected clips, respectively. In particular, given an input video, we train an
observation network in an associative reinforcement learning setting to
maximize the rewards of the selected clips with a correct prediction. We
perform extensive experiments to study different aspects of the DSN framework
on four action recognition datasets: UCF101, HMDB51, THUMOS14, and ActivityNet
v1.3. The experimental results demonstrate that DSN is able to greatly improve
the inference efficiency by only using less than half of the clips, which can
still obtain a slightly better or comparable recognition accuracy to the
state-of-the-art approaches.
Related papers
- Unsupervised Modality-Transferable Video Highlight Detection with Representation Activation Sequence Learning [7.908887001497406]
We propose a novel model with cross-modal perception for unsupervised highlight detection.
The proposed model learns representations with visual-audio level semantics from image-audio pair data via a self-reconstruction task.
The experimental results show that the proposed framework achieves superior performance compared to other state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-14T13:52:03Z) - Adversarial Augmentation Training Makes Action Recognition Models More
Robust to Realistic Video Distribution Shifts [13.752169303624147]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.
We propose two novel evaluation methods to assess model resilience to such distribution disparity.
We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z) - Semi-supervised Active Learning for Video Action Detection [8.110693267550346]
We develop a novel semi-supervised active learning approach which utilizes both labeled as well as unlabeled data.
We evaluate the proposed approach on three different benchmark datasets, UCF-24-101, JHMDB-21, and Youtube-VOS.
arXiv Detail & Related papers (2023-12-12T11:13:17Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Temporal Contrastive Learning with Curriculum [19.442685015494316]
ConCur is a contrastive video representation learning method that uses curriculum learning to impose a dynamic sampling strategy.
We conduct experiments on two popular action recognition datasets, UCF101 and HMDB51, on which our proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-09-02T00:12:05Z) - NSNet: Non-saliency Suppression Sampler for Efficient Video Recognition [89.84188594758588]
A novel Non-saliency Suppression Network (NSNet) is proposed to suppress the responses of non-salient frames.
NSNet achieves the state-of-the-art accuracy-efficiency trade-off and presents a significantly faster (2.44.3x) practical inference speed than state-of-the-art methods.
arXiv Detail & Related papers (2022-07-21T09:41:22Z) - Active Learning for Deep Visual Tracking [51.5063680734122]
Convolutional neural networks (CNNs) have been successfully applied to the single target tracking task in recent years.
In this paper, we propose an active learning method for deep visual tracking, which selects and annotates the unlabeled samples to train the deep CNNs model.
Under the guidance of active learning, the tracker based on the trained deep CNNs model can achieve competitive tracking performance while reducing the labeling cost.
arXiv Detail & Related papers (2021-10-17T11:47:56Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Unsupervised Learning of Video Representations via Dense Trajectory
Clustering [86.45054867170795]
This paper addresses the task of unsupervised learning of representations for action recognition in videos.
We first propose to adapt two top performing objectives in this class - instance recognition and local aggregation.
We observe promising performance, but qualitative analysis shows that the learned representations fail to capture motion patterns.
arXiv Detail & Related papers (2020-06-28T22:23:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.