Unsupervised Motion Representation Learning with Capsule Autoencoders
- URL: http://arxiv.org/abs/2110.00529v1
- Date: Fri, 1 Oct 2021 16:52:03 GMT
- Title: Unsupervised Motion Representation Learning with Capsule Autoencoders
- Authors: Ziwei Xu, Xudong Shen, Yongkang Wong, Mohan S Kankanhalli
- Abstract summary: Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
- Score: 54.81628825371412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose the Motion Capsule Autoencoder (MCAE), which addresses a key
challenge in the unsupervised learning of motion representations:
transformation invariance. MCAE models motion in a two-level hierarchy. In the
lower level, a spatio-temporal motion signal is divided into short, local, and
semantic-agnostic snippets. In the higher level, the snippets are aggregated to
form full-length semantic-aware segments. For both levels, we represent motion
with a set of learned transformation invariant templates and the corresponding
geometric transformations by using capsule autoencoders of a novel design. This
leads to a robust and efficient encoding of viewpoint changes. MCAE is
evaluated on a novel Trajectory20 motion dataset and various real-world
skeleton-based human action datasets. Notably, it achieves better results than
baselines on Trajectory20 with considerably fewer parameters and
state-of-the-art performance on the unsupervised skeleton-based action
recognition task.
Related papers
- JointMotion: Joint Self-Supervision for Joint Motion Prediction [10.44846560021422]
JointMotion is a self-supervised pre-training method for joint motion prediction in self-driving vehicles.
Our method reduces the joint final displacement error of Wayformer, HPTR, and Scene Transformer models by 3%, 8%, and 12%, respectively.
arXiv Detail & Related papers (2024-03-08T17:54:38Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions.
Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset.
We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - MODETR: Moving Object Detection with Transformers [2.4366811507669124]
Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline.
In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams.
We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformers for both spatial and motion modalities.
arXiv Detail & Related papers (2021-06-21T21:56:46Z) - Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation.
An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder.
In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.