Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds
- URL: http://arxiv.org/abs/2308.13866v1
- Date: Sat, 26 Aug 2023 12:55:18 GMT
- Title: Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds
- Authors: Yukun Su, Guosheng Lin, Qingyao Wu
- Abstract summary: We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
- Score: 88.87985219999764
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning has proved to be very effective in video action recognition.
Video violence recognition attempts to learn the human multi-dynamic behaviours
in more complex scenarios. In this work, we develop a method for video violence
recognition from a new perspective of skeleton points. Unlike the previous
works, we first formulate 3D skeleton point clouds from human skeleton
sequences extracted from videos and then perform interaction learning on these
3D skeleton point clouds. Specifically, we propose two types of Skeleton Points
Interaction Learning (SPIL) strategies: (i) Local-SPIL: by constructing a
specific weight distribution strategy between local regional points, Local-SPIL
aims to selectively focus on the most relevant parts of them based on their
features and spatial-temporal position information. In order to capture diverse
types of relation information, a multi-head mechanism is designed to aggregate
different features from independent heads to jointly handle different types of
relationships between points. (ii) Global-SPIL: to better learn and refine the
features of the unordered and unstructured skeleton points, Global-SPIL employs
the self-attention layer that operates directly on the sampled points, which
can help to make the output more permutation-invariant and well-suited for our
task. Extensive experimental results validate the effectiveness of our approach
and show that our model outperforms the existing networks and achieves new
state-of-the-art performance on video violence datasets.
Related papers
- CLR-GAM: Contrastive Point Cloud Learning with Guided Augmentation and
Feature Mapping [12.679625717350113]
We present CLR-GAM, a contrastive learning-based framework with Guided Augmentation (GA) for efficient dynamic exploration strategy.
We empirically demonstrate that the proposed approach achieves state-of-the-art performance on both simulated and real-world 3D point cloud datasets.
arXiv Detail & Related papers (2023-02-28T04:38:52Z) - Action Keypoint Network for Efficient Video Recognition [63.48422805355741]
This paper proposes to integrate temporal and spatial selection into an Action Keypoint Network (AK-Net)
AK-Net selects some informative points scattered in arbitrary-shaped regions as a set of action keypoints and then transforms the video recognition into point cloud classification.
Experimental results show that AK-Net can consistently improve the efficiency and performance of baseline methods on several video recognition benchmarks.
arXiv Detail & Related papers (2022-01-17T09:35:34Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Learning Multi-Granular Spatio-Temporal Graph Network for Skeleton-based
Action Recognition [49.163326827954656]
We propose a novel multi-granular-temporal graph network for skeleton-based action classification.
We develop a dual-head graph network consisting of two inter-leaved branches, which enables us to extract at least two-temporal resolutions.
We conduct extensive experiments on three large-scale datasets.
arXiv Detail & Related papers (2021-08-10T09:25:07Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Group-Skeleton-Based Human Action Recognition in Complex Events [15.649778891665468]
We propose a novel group-skeleton-based human action recognition method in complex events.
This method first utilizes multi-scale spatial-temporal graph convolutional networks (MS-G3Ds) to extract skeleton features from multiple persons.
Results on the HiEve dataset show that our method can give superior performance compared to other state-of-the-art methods.
arXiv Detail & Related papers (2020-11-26T13:19:14Z) - Improving Point Cloud Semantic Segmentation by Learning 3D Object
Detection [102.62963605429508]
Point cloud semantic segmentation plays an essential role in autonomous driving.
Current 3D semantic segmentation networks focus on convolutional architectures that perform great for well represented classes.
We propose a novel Aware 3D Semantic Detection (DASS) framework that explicitly leverages localization features from an auxiliary 3D object detection task.
arXiv Detail & Related papers (2020-09-22T14:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.