Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation
with Wordless Training
- URL: http://arxiv.org/abs/2210.15929v3
- Date: Fri, 24 Mar 2023 08:43:35 GMT
- Title: Being Comes from Not-being: Open-vocabulary Text-to-Motion Generation
with Wordless Training
- Authors: Junfan Lin, Jianlong Chang, Lingbo Liu, Guanbin Li, Liang Lin, Qi
Tian, Chang Wen Chen
- Abstract summary: In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner.
Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion.
Our method reformulates the input text into a masked motion as the prompt for the motion generator to reconstruct'' the motion.
- Score: 178.09150600453205
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-motion generation is an emerging and challenging problem, which aims
to synthesize motion with the same semantics as the input text. However, due to
the lack of diverse labeled training data, most approaches either limit to
specific types of text annotations or require online optimizations to cater to
the texts during inference at the cost of efficiency and stability. In this
paper, we investigate offline open-vocabulary text-to-motion generation in a
zero-shot learning manner that neither requires paired training data nor extra
online optimization to adapt for unseen texts. Inspired by the prompt learning
in NLP, we pretrain a motion generator that learns to reconstruct the full
motion from the masked motion. During inference, instead of changing the motion
generator, our method reformulates the input text into a masked motion as the
prompt for the motion generator to ``reconstruct'' the motion. In constructing
the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose
generator. To supervise the optimization of the text-to-pose generator, we
propose the first text-pose alignment model for measuring the alignment between
texts and 3D poses. And to prevent the pose generator from overfitting to
limited training texts, we further propose a novel wordless training mechanism
that optimizes the text-to-pose generator without any training texts. The
comprehensive experimental results show that our method obtains a significant
improvement against the baseline methods. The code is available at
https://github.com/junfanlin/oohmg.
Related papers
- LaMP: Language-Motion Pretraining for Motion Generation, Retrieval, and Captioning [19.801187860991117]
This work introduces LaMP, a novel Language-Motion Pretraining model.
LaMP generates motion-informative text embeddings, significantly enhancing the relevance and semantics of generated motion sequences.
For captioning, we finetune a large language model with the language-informative motion features to develop a strong motion captioning model.
arXiv Detail & Related papers (2024-10-09T17:33:03Z) - Dynamic Prompt Optimizing for Text-to-Image Generation [63.775458908172176]
We introduce the textbfPrompt textbfAuto-textbfEditing (PAE) method to improve text-to-image generative models.
We employ an online reinforcement learning strategy to explore the weights and injection time steps of each word, leading to the dynamic fine-control prompts.
arXiv Detail & Related papers (2024-04-05T13:44:39Z) - OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers [45.808597624491156]
We present OMG, a novel framework, which enables compelling motion generation from zero-shot open-vocabulary text prompts.
At the pre-training stage, our model improves the generation ability by learning the rich out-of-domain inherent motion traits.
At the fine-tuning stage, we introduce motion ControlNet, which incorporates text prompts as conditioning information.
arXiv Detail & Related papers (2023-12-14T14:31:40Z) - LivePhoto: Real Image Animation with Text-guided Motion Control [51.31418077586208]
This work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions.
We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input.
We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions.
arXiv Detail & Related papers (2023-12-05T17:59:52Z) - Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion
Model [11.873294782380984]
We propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description.
Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference.
arXiv Detail & Related papers (2023-09-12T14:43:47Z) - TM2T: Stochastic and Tokenized Modeling for the Reciprocal Generation of
3D Human Motions and Texts [20.336481832461168]
Inspired by the strong ties between vision and language, our paper aims to explore the generation of 3D human full-body motions from texts.
We propose the use of motion token, a discrete and compact motion representation.
Our approach is flexible, could be used for both text2motion and motion2text tasks.
arXiv Detail & Related papers (2022-07-04T19:52:18Z) - Event Transition Planning for Open-ended Text Generation [55.729259805477376]
Open-ended text generation tasks require models to generate a coherent continuation given limited preceding context.
We propose a novel two-stage method which explicitly arranges the ensuing events in open-ended text generation.
Our approach can be understood as a specially-trained coarse-to-fine algorithm.
arXiv Detail & Related papers (2022-04-20T13:37:51Z) - Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes.
An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences.
The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z) - PALM: Pre-training an Autoencoding&Autoregressive Language Model for
Context-conditioned Generation [92.7366819044397]
Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation.
This work presents PALM with a novel scheme that jointly pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus.
An extensive set of experiments show that PALM achieves new state-of-the-art results on a variety of language generation benchmarks.
arXiv Detail & Related papers (2020-04-14T06:25:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.