Action Keypoint Network for Efficient Video Recognition
- URL: http://arxiv.org/abs/2201.06304v1
- Date: Mon, 17 Jan 2022 09:35:34 GMT
- Title: Action Keypoint Network for Efficient Video Recognition
- Authors: Xu Chen, Yahong Han, Xiaohan Wang, Yifan Sun, Yi Yang
- Abstract summary: This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
- Score: 63.48422805355741
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Reducing redundancy is crucial for improving the efficiency of video
recognition models. An effective approach is to select informative content from
the holistic video, yielding a popular family of dynamic video recognition
methods. However, existing dynamic methods focus on either temporal or spatial
selection independently while neglecting a reality that the redundancies are
usually spatial and temporal, simultaneously. Moreover, their selected content
is usually cropped with fixed shapes, while the realistic distribution of
informative content can be much more diverse. With these two insights, this
paper proposes to integrate temporal and spatial selection into an Action
Keypoint Network (AK-Net). From different frames and positions, AK-Net selects
some informative points scattered in arbitrary-shaped regions as a set of
action keypoints and then transforms the video recognition into point cloud
classification. AK-Net has two steps, i.e., the keypoint selection and the
point cloud classification. First, it inputs the video into a baseline network
and outputs a feature map from an intermediate layer. We view each pixel on
this feature map as a spatial-temporal point and select some informative
keypoints using self-attention. Second, AK-Net devises a ranking criterion to
arrange the keypoints into an ordered 1D sequence. Consequentially, AK-Net
brings two-fold benefits for efficiency: The keypoint selection step collects
informative content within arbitrary shapes and increases the efficiency for
modeling spatial-temporal dependencies, while the point cloud classification
step further reduces the computational cost by compacting the convolutional
kernels. Experimental results show that AK-Net can consistently improve the
efficiency and performance of baseline methods on several video recognition
benchmarks.
Related papers
- Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - Deep Unsupervised Key Frame Extraction for Efficient Video
Classification [63.25852915237032]
This work presents an unsupervised method to retrieve the key frames, which combines Convolutional Neural Network (CNN) and Temporal Segment Density Peaks Clustering (TSDPC)
The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.
Furthermore, a Long Short-Term Memory network (LSTM) is added on the top of the CNN to further elevate the performance of classification.
arXiv Detail & Related papers (2022-11-12T20:45:35Z) - AdaFocusV3: On Unified Spatial-temporal Dynamic Video Recognition [44.10959567844497]
This paper explores the unified formulation of spatial-temporal dynamic on top of the recently proposed AdaFocusV2 algorithm.
AdaFocusV3 can be effectively trained by approximating the non-differentiable cropping operation with the computation of deep features.
arXiv Detail & Related papers (2022-09-27T15:30:52Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Adaptive Focus for Efficient Video Recognition [29.615394426035074]
We propose a reinforcement learning based approach for efficient spatially adaptive video recognition (AdaFocus)
A light-weighted ConvNet is first adopted to quickly process the full video sequence, whose features are used by a recurrent policy network to localize the most task-relevant regions.
During offline inference, once the informative patch sequence has been generated, the bulk of computation can be done in parallel, and is efficient on modern GPU devices.
arXiv Detail & Related papers (2021-05-07T13:24:47Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z) - FPS-Net: A Convolutional Fusion Network for Large-Scale LiDAR Point
Cloud Segmentation [30.736361776703568]
Scene understanding based on LiDAR point cloud is an essential task for autonomous cars to drive safely.
Most existing methods simply stack different point attributes/modalities as image channels to increase information capacity.
We design FPS-Net, a convolutional fusion network that exploits the uniqueness and discrepancy among the projected image channels for optimal point cloud segmentation.
arXiv Detail & Related papers (2021-03-01T04:08:28Z) - NUTA: Non-uniform Temporal Aggregation for Action Recognition [29.75987323741384]
We propose a method called the non-uniform temporal aggregation (NUTA), which aggregates features only from informative temporal segments.
Our model has achieved state-of-the-art performance on four widely used large-scale action-recognition datasets.
arXiv Detail & Related papers (2020-12-15T02:03:37Z) - Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos.
Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization.
We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction.
Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.