LAC: Latent Action Composition for Skeleton-based Action Segmentation
- URL: http://arxiv.org/abs/2308.14500v4
- Date: Wed, 21 Feb 2024 18:50:50 GMT
- Title: LAC: Latent Action Composition for Skeleton-based Action Segmentation
- Authors: Di Yang, Yaohui Wang, Antitza Dantcheva, Quan Kong, Lorenzo Garattoni,
Gianpiero Francesca, Francois Bremond
- Abstract summary: Skeleton-based action segmentation requires recognizing composable actions in untrimmed videos.
Current approaches decouple this problem by first extracting local visual features from skeleton sequences and then processing them by a temporal model to classify frame-wise actions.
We propose Latent Action Composition (LAC), a novel self-supervised framework aiming at learning from synthesized composable motions for skeleton-based action segmentation.
- Score: 21.797658771678066
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Skeleton-based action segmentation requires recognizing composable actions in
untrimmed videos. Current approaches decouple this problem by first extracting
local visual features from skeleton sequences and then processing them by a
temporal model to classify frame-wise actions. However, their performances
remain limited as the visual features cannot sufficiently express composable
actions. In this context, we propose Latent Action Composition (LAC), a novel
self-supervised framework aiming at learning from synthesized composable
motions for skeleton-based action segmentation. LAC is composed of a novel
generation module towards synthesizing new sequences. Specifically, we design a
linear latent space in the generator to represent primitive motion. New
composed motions can be synthesized by simply performing arithmetic operations
on latent representations of multiple input skeleton sequences. LAC leverages
such synthesized sequences, which have large diversity and complexity, for
learning visual representations of skeletons in both sequence and frame spaces
via contrastive learning. The resulting visual encoder has a high expressive
power and can be effectively transferred onto action segmentation tasks by
end-to-end fine-tuning without the need for additional temporal models. We
conduct a study focusing on transfer-learning and we show that representations
learned from pre-trained LAC outperform the state-of-the-art by a large margin
on TSU, Charades, PKU-MMD datasets.
Related papers
- An Information Compensation Framework for Zero-Shot Skeleton-based Action Recognition [49.45660055499103]
Zero-shot human skeleton-based action recognition aims to construct a model that can recognize actions outside the categories seen during training.
Previous research has focused on aligning sequences' visual and semantic spatial distributions.
We introduce a new loss function sampling method to obtain a tight and robust representation.
arXiv Detail & Related papers (2024-06-02T06:53:01Z) - Skeleton2vec: A Self-supervised Learning Framework with Contextualized
Target Representations for Skeleton Sequence [56.092059713922744]
We show that using high-level contextualized features as prediction targets can achieve superior performance.
Specifically, we propose Skeleton2vec, a simple and efficient self-supervised 3D action representation learning framework.
Our proposed Skeleton2vec outperforms previous methods and achieves state-of-the-art results.
arXiv Detail & Related papers (2024-01-01T12:08:35Z) - SkeleTR: Towrads Skeleton-based Action Recognition in the Wild [86.03082891242698]
SkeleTR is a new framework for skeleton-based action recognition.
It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions.
It then uses stacked Transformer encoders to capture person interactions that are important for action recognition in general scenarios.
arXiv Detail & Related papers (2023-09-20T16:22:33Z) - Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for
Self-Supervised Skeleton-Based Action Recognition [21.546894064451898]
We show that directly extending contrastive pairs based on normal augmentations brings limited returns in terms of performance.
We propose SkeleMixCLR: a contrastive learning framework with atemporal skeleton mixing augmentation (SkeleMix) to complement current contrastive learning approaches.
arXiv Detail & Related papers (2022-07-07T03:18:09Z) - SimMC: Simple Masked Contrastive Learning of Skeleton Representations
for Unsupervised Person Re-Identification [63.903237777588316]
We present a generic Simple Masked Contrastive learning (SimMC) framework to learn effective representations from unlabeled 3D skeletons for person re-ID.
Specifically, to fully exploit skeleton features within each skeleton sequence, we first devise a masked prototype contrastive learning (MPC) scheme.
Then, we propose the masked intra-sequence contrastive learning (MIC) to capture intra-sequence pattern consistency between subsequences.
arXiv Detail & Related papers (2022-04-21T00:19:38Z) - Skeleton-Contrastive 3D Action Representation Learning [35.06361753065124]
This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition.
Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets.
arXiv Detail & Related papers (2021-08-08T14:44:59Z) - Tensor Representations for Action Recognition [54.710267354274194]
Human actions in sequences are characterized by the complex interplay between spatial features and their temporal dynamics.
We propose novel tensor representations for capturing higher-order relationships between visual features for the task of action recognition.
We use higher-order tensors and so-called Eigenvalue Power Normalization (NEP) which have been long speculated to perform spectral detection of higher-order occurrences.
arXiv Detail & Related papers (2020-12-28T17:27:18Z) - Skeleton-Aware Networks for Deep Motion Retargeting [83.65593033474384]
We introduce a novel deep learning framework for data-driven motion between skeletons.
Our approach learns how to retarget without requiring any explicit pairing between the motions in the training set.
arXiv Detail & Related papers (2020-05-12T12:51:40Z) - Skeleton Based Action Recognition using a Stacked Denoising Autoencoder
with Constraints of Privileged Information [5.67220249825603]
We propose a new method to study the skeletal representation in a view of skeleton reconstruction.
Based on the concept of learning under privileged information, we integrate action categories and temporal coordinates into a stacked denoising autoencoder.
In order to mitigate the variation resulting from temporary misalignment, a new method of temporal registration is proposed.
arXiv Detail & Related papers (2020-03-12T09:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.