Efficient Modelling Across Time of Human Actions and Interactions
- URL: http://arxiv.org/abs/2110.02120v1
- Date: Tue, 5 Oct 2021 15:39:11 GMT
- Title: Efficient Modelling Across Time of Human Actions and Interactions
- Authors: Alexandros Stergiou
- Abstract summary: We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
- Score: 92.39082696657874
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This thesis focuses on video understanding for human action and interaction
recognition. We start by identifying the main challenges related to action
recognition from videos and review how they have been addressed by current
methods.
Based on these challenges, and by focusing on the temporal aspect of actions,
we argue that current fixed-sized spatio-temporal kernels in 3D convolutional
neural networks (CNNs) can be improved to better deal with temporal variations
in the input. Our contributions are based on the enlargement of the
convolutional receptive fields through the introduction of spatio-temporal
size-varying segments of videos, as well as the discovery of the local feature
relevance over the entire video sequence. The resulting extracted features
encapsulate information that includes the importance of local features across
multiple temporal durations, as well as the entire video sequence.
Subsequently, we study how we can better handle variations between classes of
actions, by enhancing their feature differences over different layers of the
architecture. The hierarchical extraction of features models variations of
relatively similar classes the same as very dissimilar classes. Therefore,
distinctions between similar classes are less likely to be modelled. The
proposed approach regularises feature maps by amplifying features that
correspond to the class of the video that is processed. We move away from
class-agnostic networks and make early predictions based on feature
amplification mechanism.
The proposed approaches are evaluated on several benchmark action recognition
datasets and show competitive results. In terms of performance, we compete with
the state-of-the-art while being more efficient in terms of GFLOPs.
Finally, we present a human-understandable approach aimed at providing visual
explanations for features learned over spatio-temporal networks.
Related papers
- Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - Self-Supervised Video Representation Learning via Latent Time Navigation [12.721647696921865]
Self-supervised video representation learning aims at maximizing similarity between different temporal segments of one video.
We propose Latent Time Navigation (LTN) to capture fine-grained motions.
Our experimental analysis suggests that learning video representations by LTN consistently improves performance of action classification.
arXiv Detail & Related papers (2023-05-10T20:06:17Z) - TimeBalance: Temporally-Invariant and Temporally-Distinctive Video
Representations for Semi-Supervised Action Recognition [68.53072549422775]
We propose a student-teacher semi-supervised learning framework, TimeBalance.
We distill the knowledge from a temporally-invariant and a temporally-distinctive teacher.
Our method achieves state-of-the-art performance on three action recognition benchmarks.
arXiv Detail & Related papers (2023-03-28T19:28:54Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - TAEC: Unsupervised Action Segmentation with Temporal-Aware Embedding and
Clustering [27.52568444236988]
We propose an unsupervised approach for learning action classes from untrimmed video sequences.
In particular, we propose a temporal embedding network that combines relative time prediction, feature reconstruction, and sequence-to-sequence learning.
Based on the identified clusters, we decode the video into coherent temporal segments that correspond to semantically meaningful action classes.
arXiv Detail & Related papers (2023-03-09T10:46:23Z) - Class-Incremental Learning for Action Recognition in Videos [44.923719189467164]
We tackle catastrophic forgetting problem in the context of class-incremental learning for video recognition.
Our framework addresses this challenging task by introducing time-channel importance maps and exploiting the importance maps for learning the representations of incoming examples.
We evaluate the proposed approach on brand-new splits of class-incremental action recognition benchmarks constructed upon the UCF101, HMDB51, and Something-Something V2 datasets.
arXiv Detail & Related papers (2022-03-25T12:15:49Z) - Sequential convolutional network for behavioral pattern extraction in
gait recognition [0.7874708385247353]
We propose a sequential convolutional network (SCN) to learn the walking pattern of individuals.
In SCN, behavioral information extractors (BIE) are constructed to comprehend intermediate feature maps in time series.
A multi-frame aggregator in SCN performs feature integration on a sequence whose length is uncertain, via a mobile 3D convolutional layer.
arXiv Detail & Related papers (2021-04-23T08:44:10Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z) - Coarse Temporal Attention Network (CTA-Net) for Driver's Activity
Recognition [14.07119502083967]
Driver's activities are different since they are executed by the same subject with similar body parts movements, resulting in subtle changes.
Our model is named Coarse Temporal Attention Network (CTA-Net), in which coarse temporal branches are introduced in a trainable glimpse.
The model then uses an innovative attention mechanism to generate high-level action specific contextual information for activity recognition.
arXiv Detail & Related papers (2021-01-17T10:15:37Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.