Recurrent Transformer Variational Autoencoders for Multi-Action Motion
Synthesis
- URL: http://arxiv.org/abs/2206.06741v1
- Date: Tue, 14 Jun 2022 10:40:16 GMT
- Title: Recurrent Transformer Variational Autoencoders for Multi-Action Motion
Synthesis
- Authors: Rania Briq, Chuhang Zou, Leonid Pishchulin, Chris Broaddus, Juergen
Gall
- Abstract summary: We consider the problem of synthesizing multi-action human motion sequences of arbitrary lengths.
Existing approaches have mastered motion sequence generation in single-action scenarios, but fail to generalize to multi-action and arbitrary-length sequences.
We propose a novel efficient approach that leverages the richness of Recurrent Transformers and generative richness of conditional Variational Autoencoders.
- Score: 17.15415641710113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We consider the problem of synthesizing multi-action human motion sequences
of arbitrary lengths. Existing approaches have mastered motion sequence
generation in single-action scenarios, but fail to generalize to multi-action
and arbitrary-length sequences. We fill this gap by proposing a novel efficient
approach that leverages the expressiveness of Recurrent Transformers and
generative richness of conditional Variational Autoencoders. The proposed
iterative approach is able to generate smooth and realistic human motion
sequences with an arbitrary number of actions and frames while doing so in
linear space and time. We train and evaluate the proposed approach on PROX
dataset which we augment with ground-truth action labels. Experimental
evaluation shows significant improvements in FID score and semantic consistency
metrics compared to the state-of-the-art.
Related papers
- ReCoM: Realistic Co-Speech Motion Generation with Recurrent Embedded Transformer [58.49950218437718]
We present ReCoM, an efficient framework for generating high-fidelity and generalizable human body motions synchronized with speech.
The core innovation lies in the Recurrent Embedded Transformer (RET), which integrates Dynamic Embedding Regularization (DER) into a Vision Transformer (ViT) core architecture.
To enhance model robustness, we incorporate the proposed DER strategy, which equips the model with dual capabilities of noise resistance and cross-domain generalization.
arXiv Detail & Related papers (2025-03-27T16:39:40Z) - Dita: Scaling Diffusion Transformer for Generalist Vision-Language-Action Policy [56.424032454461695]
We present Dita, a scalable framework that leverages Transformer architectures to directly denoise continuous action sequences.
Dita employs in-context conditioning -- enabling fine-grained alignment between denoised actions and raw visual tokens from historical observations.
Dita effectively integrates cross-embodiment datasets across diverse camera perspectives, observation scenes, tasks, and action spaces.
arXiv Detail & Related papers (2025-03-25T15:19:56Z) - Quantization-Free Autoregressive Action Transformer [18.499864366974613]
Current transformer-based imitation learning approaches introduce discrete action representations and train an autoregressive transformer decoder on the resulting latent code.
We propose a quantization-free method that leverages Generative Infinite-Vocabulary Transformers (GIVT) as a direct, continuous policy parametrization for autoregressive transformers.
arXiv Detail & Related papers (2025-03-18T13:50:35Z) - Human Motion Synthesis_ A Diffusion Approach for Motion Stitching and In-Betweening [2.5165775267615205]
We propose a diffusion model with a transformer-based denoiser to generate realistic human motion.
Our method demonstrated strong performance in generating in-betweening sequences.
We present the performance evaluation of our method using quantitative metrics such as Frechet Inception Distance (FID), Diversity, and Multimodality.
arXiv Detail & Related papers (2024-09-10T18:02:32Z) - Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers [13.665279127648658]
This research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously.
By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions.
arXiv Detail & Related papers (2024-09-03T04:19:27Z) - Text-guided 3D Human Motion Generation with Keyframe-based Parallel Skip Transformer [62.29951737214263]
Existing algorithms directly generate the full sequence which is expensive and prone to errors.
We propose KeyMotion, that generates plausible human motion sequences corresponding to input text.
We use a Variationalcoder (VAE) with Kullback-Leibler regularization to project the Autoencoder into a latent space.
For the reverse diffusion, we propose a novel Parallel Skip Transformer that performs cross-modal attention between the design latents and text condition.
arXiv Detail & Related papers (2024-05-24T11:12:37Z) - DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions.
Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences.
We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z) - Ring Attention with Blockwise Transformers for Near-Infinite Context [88.61687950039662]
We present a novel approach, Ring Attention with Blockwise Transformers (Ring Attention), which leverages blockwise computation of self-attention and feedforward to distribute long sequences across multiple devices.
Our approach enables training and inference of sequences that are up to device count times longer than those achievable by prior memory-efficient Transformers.
arXiv Detail & Related papers (2023-10-03T08:44:50Z) - Synthesizing Long-Term Human Motions with Diffusion Models via Coherent
Sampling [74.62570964142063]
Text-to-motion generation has gained increasing attention, but most existing methods are limited to generating short-term motions.
We propose a novel approach that utilizes a past-conditioned diffusion model with two optional coherent sampling methods.
Our proposed method is capable of generating compositional and coherent long-term 3D human motions controlled by a user-instructed long text stream.
arXiv Detail & Related papers (2023-08-03T16:18:32Z) - Executing your Commands via Motion Diffusion in Latent Space [51.64652463205012]
We propose a Motion Latent-based Diffusion model (MLD) to produce vivid motion sequences conforming to the given conditional inputs.
Our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks.
arXiv Detail & Related papers (2022-12-08T03:07:00Z) - Weakly-supervised Action Transition Learning for Stochastic Human Motion
Prediction [81.94175022575966]
We introduce the task of action-driven human motion prediction.
It aims to predict multiple plausible future motions given a sequence of action labels and a short motion history.
arXiv Detail & Related papers (2022-05-31T08:38:07Z) - Implicit Neural Representations for Variable Length Human Motion
Generation [11.028791809955276]
We propose an action-conditional human motion generation method using variational implicit neural representations (INR)
Our method offers variable-length sequence generation by construction because a part of INR is optimized for a whole sequence of arbitrary length with temporal embeddings.
We show that variable-length motions generated by our method are better than fixed-length motions generated by the state-of-the-art method in terms of realism and diversity.
arXiv Detail & Related papers (2022-03-25T15:00:38Z) - Action-Conditioned 3D Human Motion Synthesis with Transformer VAE [44.523477804533364]
We tackle the problem of action-conditioned generation of realistic and diverse human motion sequences.
In contrast to methods that complete, or extend, motion sequences, this task does not require an initial pose or sequence.
We learn an action-aware latent representation for human motions by training a generative variational autoencoder.
arXiv Detail & Related papers (2021-04-12T17:40:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.