ViA: View-invariant Skeleton Action Representation Learning via Motion
Retargeting
- URL: http://arxiv.org/abs/2209.00065v1
- Date: Wed, 31 Aug 2022 18:49:38 GMT
- Title: ViA: View-invariant Skeleton Action Representation Learning via Motion
Retargeting
- Authors: Di Yang, Yaohui Wang, Antitza Dantcheva, Lorenzo Garattoni, Gianpiero
Francesca, Francois Bremond
- Abstract summary: ViA is a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning.
We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data.
Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy.
- Score: 10.811088895926776
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current self-supervised approaches for skeleton action representation
learning often focus on constrained scenarios, where videos and skeleton data
are recorded in laboratory settings. When dealing with estimated skeleton data
in real-world videos, such methods perform poorly due to the large variations
across subjects and camera viewpoints. To address this issue, we introduce ViA,
a novel View-Invariant Autoencoder for self-supervised skeleton action
representation learning. ViA leverages motion retargeting between different
human performers as a pretext task, in order to disentangle the latent
action-specific `Motion' features on top of the visual representation of a 2D
or 3D skeleton sequence. Such `Motion' features are invariant to skeleton
geometry and camera view and allow ViA to facilitate both, cross-subject and
cross-view action classification tasks. We conduct a study focusing on
transfer-learning for skeleton-based action recognition with self-supervised
pre-training on real-world data (e.g., Posetics). Our results showcase that
skeleton representations learned from ViA are generic enough to improve upon
state-of-the-art action classification accuracy, not only on 3D laboratory
datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world
datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome,
UAV-Human and Penn Action.
Related papers
- DO3D: Self-supervised Learning of Decomposed Object-aware 3D Motion and
Depth from Monocular Videos [76.01906393673897]
We propose a self-supervised method to jointly learn 3D motion and depth from monocular videos.
Our system contains a depth estimation module to predict depth, and a new decomposed object-wise 3D motion (DO3D) estimation module to predict ego-motion and 3D object motion.
Our model delivers superior performance in all evaluated settings.
arXiv Detail & Related papers (2024-03-09T12:22:46Z) - Improving Video Violence Recognition with Human Interaction Learning on
3D Skeleton Point Clouds [88.87985219999764]
We develop a method for video violence recognition from a new perspective of skeleton points.
We first formulate 3D skeleton point clouds from human sequences extracted from videos.
We then perform interaction learning on these 3D skeleton point clouds.
arXiv Detail & Related papers (2023-08-26T12:55:18Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Learning by Aligning 2D Skeleton Sequences and Multi-Modality Fusion [8.153034573979856]
This paper presents a self-supervised temporal video alignment framework which is useful for several fine-grained human activity understanding tasks.
In contrast with the state-of-the-art method CASA, where sequences of 3D skeleton coordinates are taken directly as input, our key idea is to use sequences of 2D skeleton heatmaps as input.
arXiv Detail & Related papers (2023-05-31T01:16:08Z) - Is an Object-Centric Video Representation Beneficial for Transfer? [86.40870804449737]
We introduce a new object-centric video recognition model on a transformer architecture.
We show that the object-centric model outperforms prior video representations.
arXiv Detail & Related papers (2022-07-20T17:59:44Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - PreViTS: Contrastive Pretraining with Video Tracking Supervision [53.73237606312024]
PreViTS is an unsupervised SSL framework for selecting clips containing the same object.
PreViTS spatially constrains the frame regions to learn from and trains the model to locate meaningful objects.
We train a momentum contrastive (MoCo) encoder on VGG-Sound and Kinetics-400 datasets with PreViTS.
arXiv Detail & Related papers (2021-12-01T19:49:57Z) - Spot What Matters: Learning Context Using Graph Convolutional Networks
for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video.
Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training.
Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z) - Hindsight for Foresight: Unsupervised Structured Dynamics Models from
Physical Interaction [24.72947291987545]
Key challenge for an agent learning to interact with the world is to reason about physical properties of objects.
We propose a novel approach for modeling the dynamics of a robot's interactions directly from unlabeled 3D point clouds and images.
arXiv Detail & Related papers (2020-08-02T11:04:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.