Towards Variable and Coordinated Holistic Co-Speech Motion Generation
- URL: http://arxiv.org/abs/2404.00368v2
- Date: Mon, 15 Apr 2024 11:18:00 GMT
- Title: Towards Variable and Coordinated Holistic Co-Speech Motion Generation
- Authors: Yifei Liu, Qiong Cao, Yandong Wen, Huaiguang Jiang, Changxing Ding,
- Abstract summary: This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars.
We aim to achieve both with ProbTalk, a unified probabilistic framework designed to jointly model facial, hand, and body movements in speech.
- Score: 21.244719987338243
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars, focusing on two key aspects: variability and coordination. Variability allows the avatar to exhibit a wide range of motions even with similar speech content, while coordination ensures a harmonious alignment among facial expressions, hand gestures, and body poses. We aim to achieve both with ProbTalk, a unified probabilistic framework designed to jointly model facial, hand, and body movements in speech. ProbTalk builds on the variational autoencoder (VAE) architecture and incorporates three core designs. First, we introduce product quantization (PQ) to the VAE, which enriches the representation of complex holistic motion. Second, we devise a novel non-autoregressive model that embeds 2D positional encoding into the product-quantized representation, thereby preserving essential structure information of the PQ codes. Last, we employ a secondary stage to refine the preliminary prediction, further sharpening the high-frequency details. Coupling these three designs enables ProbTalk to generate natural and diverse holistic co-speech motions, outperforming several state-of-the-art methods in qualitative and quantitative evaluations, particularly in terms of realism. Our code and model will be released for research purposes at https://feifeifeiliu.github.io/probtalk/.
Related papers
- Combo: Co-speech holistic 3D human motion generation and efficient customizable adaptation in harmony [55.26315526382004]
We propose a novel framework, Combo, for co-speech holistic 3D human motion generation.
In particular, we identify that one fundamental challenge as the multiple-input-multiple-output nature of the generative model of interest.
Combo is highly effective in generating high-quality motions but also efficient in transferring identity and emotion.
arXiv Detail & Related papers (2024-08-18T07:48:49Z) - Beyond Talking -- Generating Holistic 3D Human Dyadic Motion for Communication [17.294279444027563]
We introduce an innovative task focused on human communication, aiming to generate 3D holistic human motions for both speakers and listeners.
We consider the real-time mutual influence between the speaker and the listener and propose a novel chain-like transformer-based auto-regressive model.
Our approach demonstrates state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2024-03-28T14:47:32Z) - FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio.
Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency.
We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks,
Methods, and Applications [20.842799581850617]
We consider the task of animating 3D facial geometry from speech signal.
Existing works are primarily deterministic, focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers.
arXiv Detail & Related papers (2023-11-30T01:14:43Z) - Generating Holistic 3D Human Motion from Speech [97.11392166257791]
We build a high-quality dataset of 3D holistic body meshes with synchronous speech.
We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately.
arXiv Detail & Related papers (2022-12-08T17:25:19Z) - TEMOS: Generating diverse human motions from textual descriptions [53.85978336198444]
We address the problem of generating diverse 3D human motions from textual descriptions.
We propose TEMOS, a text-conditioned generative model leveraging variational autoencoder (VAE) training with human motion data.
We show that TEMOS framework can produce both skeleton-based animations as in prior work, as well more expressive SMPL body motions.
arXiv Detail & Related papers (2022-04-25T14:53:06Z) - Learned Spatial Representations for Few-shot Talking-Head Synthesis [68.3787368024951]
We propose a novel approach for few-shot talking-head synthesis.
We show that this disentangled representation leads to a significant improvement over previous methods.
arXiv Detail & Related papers (2021-04-29T17:59:42Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.