HumanTOMATO: Text-aligned Whole-body Motion Generation
- URL: http://arxiv.org/abs/2310.12978v1
- Date: Thu, 19 Oct 2023 17:59:46 GMT
- Title: HumanTOMATO: Text-aligned Whole-body Motion Generation
- Authors: Shunlin Lu, Ling-Hao Chen, Ailing Zeng, Jing Lin, Ruimao Zhang, Lei
Zhang, Heung-Yeung Shum
- Abstract summary: This work targets a novel text-driven whole-body motion generation task.
It aims at generating high-quality, diverse, and coherent facial expressions, hand gestures, and body motions simultaneously.
- Score: 30.729975715600627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This work targets a novel text-driven whole-body motion generation task,
which takes a given textual description as input and aims at generating
high-quality, diverse, and coherent facial expressions, hand gestures, and body
motions simultaneously. Previous works on text-driven motion generation tasks
mainly have two limitations: they ignore the key role of fine-grained hand and
face controlling in vivid whole-body motion generation, and lack a good
alignment between text and motion. To address such limitations, we propose a
Text-aligned whOle-body Motion generATiOn framework, named HumanTOMATO, which
is the first attempt to our knowledge towards applicable holistic motion
generation in this research area. To tackle this challenging task, our solution
includes two key designs: (1) a Holistic Hierarchical VQ-VAE (aka H$^2$VQ) and
a Hierarchical-GPT for fine-grained body and hand motion reconstruction and
generation with two structured codebooks; and (2) a pre-trained
text-motion-alignment model to help generated motion align with the input
textual description explicitly. Comprehensive experiments verify that our model
has significant advantages in both the quality of generated motions and their
alignment with text.
Related papers
- MotionGPT-2: A General-Purpose Motion-Language Model for Motion Generation and Understanding [76.30210465222218]
MotionGPT-2 is a unified Large Motion-Language Model (LMLMLM)
It supports multimodal control conditions through pre-trained Large Language Models (LLMs)
It is highly adaptable to the challenging 3D holistic motion generation task.
arXiv Detail & Related papers (2024-10-29T05:25:34Z) - Learning Generalizable Human Motion Generator with Reinforcement Learning [95.62084727984808]
Text-driven human motion generation is one of the vital tasks in computer-aided content creation.
Existing methods often overfit specific motion expressions in the training data, hindering their ability to generalize.
We present textbfInstructMotion, which incorporate the trail and error paradigm in reinforcement learning for generalizable human motion generation.
arXiv Detail & Related papers (2024-05-24T13:29:12Z) - Generating Human Motion in 3D Scenes from Text Descriptions [60.04976442328767]
This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions.
We propose a new approach that decomposes the complex problem into two more manageable sub-problems.
For language grounding of the target object, we leverage the power of large language models; for motion generation, we design an object-centric scene representation.
arXiv Detail & Related papers (2024-05-13T14:30:12Z) - BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics [50.88842027976421]
We propose BOTH57M, a novel multi-modal dataset for two-hand motion generation.
Our dataset includes accurate motion tracking for the human body and hands.
We also provide a strong baseline method, BOTH2Hands, for the novel task.
arXiv Detail & Related papers (2023-12-13T07:30:19Z) - Story-to-Motion: Synthesizing Infinite and Controllable Character
Animation from Long Text [14.473103773197838]
A new task, Story-to-Motion, arises when characters are required to perform specific motions based on a long text description.
Previous works in character control and text-to-motion have addressed related aspects, yet a comprehensive solution remains elusive.
We propose a novel system that generates controllable, infinitely long motions and trajectories aligned with the input text.
arXiv Detail & Related papers (2023-11-13T16:22:38Z) - Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion
Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description.
Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z) - AttT2M: Text-Driven Human Motion Generation with Multi-Perspective
Attention Mechanism [24.049207982022214]
We propose textbftT2M, a two-stage method with multi-perspective attention mechanism.
Our method outperforms the current state-of-the-art in terms of qualitative and quantitative evaluation.
arXiv Detail & Related papers (2023-09-02T02:18:17Z) - Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP)
Taking the first frame and text caption as inputs, this task aims to synthesize the following frames.
To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.