TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts
- URL: http://arxiv.org/abs/2207.01696v1
- Date: Mon, 4 Jul 2022 19:52:18 GMT
- Title: TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts
- Authors: Chuan Guo, Xinxin Xuo, Sen Wang, Li Cheng
- Abstract summary: Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
- Score: 20.336481832461168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by the strong ties between vision and language, the two intimate
human sensing and communication modalities, our paper aims to explore the
generation of 3D human full-body motions from texts, as well as its reciprocal
task, shorthanded for text2motion and motion2text, respectively. To tackle the
existing challenges, especially to enable the generation of multiple distinct
motions from the same text, and to avoid the undesirable production of trivial
motionless pose sequences, we propose the use of motion token, a discrete and
compact motion representation. This provides one level playing ground when
considering both motions and text signals, as the motion and text tokens,
respectively. Moreover, our motion2text module is integrated into the inverse
alignment process of our text2motion training pipeline, where a significant
deviation of synthesized text from the input text would be penalized by a large
training loss; empirically this is shown to effectively improve performance.
Finally, the mappings in-between the two modalities of motions and texts are
facilitated by adapting the neural model for machine translation (NMT) to our
context. This autoregressive modeling of the distribution over discrete motion
tokens further enables non-deterministic production of pose sequences, of
variable lengths, from an input text. Our approach is flexible, could be used
for both text2motion and motion2text tasks. Empirical evaluations on two
benchmark datasets demonstrate the superior performance of our approach on both
tasks over a variety of state-of-the-art methods. Project page:
https://ericguo5513.github.io/TM2T/
Related papers
- BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - Story-to-Motion: Synthesizing Infinite and Controllable Character
Animation from Long Text [14.473103773197838]
A new task, Story-to-Motion, arises when characters are required to perform specific motions based on a long text description.
Previous works in character control and text-to-motion have addressed related aspects, yet a comprehensive solution remains elusive.
We propose a novel system that generates controllable, infinitely long motions and trajectories aligned with the input text.
arXiv Detail & Related papers (2023-11-13T16:22:38Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion
Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description.
Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z) - AttT2M: Text-Driven Human Motion Generation with Multi-Perspective
Attention Mechanism [24.049207982022214]
We propose textbftT2M, a two-stage method with multi-perspective attention mechanism.
Our method outperforms the current state-of-the-art in terms of qualitative and quantitative evaluation.
arXiv Detail & Related papers (2023-09-02T02:18:17Z) - TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration [75.37311932218773]
We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities.
Our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities.
arXiv Detail & Related papers (2023-04-05T12:58:33Z) - Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation
with Wordless Training [178.09150600453205]
In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner.
Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion.
Our method reformulates the input text into a masked motion as the prompt for the motion generator to reconstruct'' the motion.
arXiv Detail & Related papers (2022-10-28T06:20:55Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.