AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech
Gesture Synthesis
- URL: http://arxiv.org/abs/2305.01241v2
- Date: Mon, 8 May 2023 11:59:12 GMT
- Title: AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech
Gesture Synthesis
- Authors: Hendric Vo{\ss} and Stefan Kopp
- Abstract summary: We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline.
By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The generation of realistic and contextually relevant co-speech gestures is a
challenging yet increasingly important task in the creation of multimodal
artificial agents. Prior methods focused on learning a direct correspondence
between co-speech gesture representations and produced motions, which created
seemingly natural but often unconvincing gestures during human assessment. We
present an approach to pre-train partial gesture sequences using a generative
adversarial network with a quantization pipeline. The resulting codebook
vectors serve as both input and output in our framework, forming the basis for
the generation and reconstruction of gestures. By learning the mapping of a
latent space representation as opposed to directly mapping it to a vector
representation, this framework facilitates the generation of highly realistic
and expressive gestures that closely replicate human movement and behavior,
while simultaneously avoiding artifacts in the generation process. We evaluate
our approach by comparing it with established methods for generating co-speech
gestures as well as with existing datasets of human behavior. We also perform
an ablation study to assess our findings. The results show that our approach
outperforms the current state of the art by a clear margin and is partially
indistinguishable from human gesturing. We make our data pipeline and the
generation framework publicly available.
Related papers
- Incorporating Spatial Awareness in Data-Driven Gesture Generation for Virtual Agents [17.299991009921307]
This paper focuses on enhancing human-agent communication by integrating spatial context into virtual agents' non-verbal behaviors, specifically gestures.
Recent advances in co-speech gesture generation have primarily utilized data-driven methods, which create natural motion but limit the scope of gestures to those performed in a void.
Our work aims to extend these methods by enabling generative models to incorporate scene information into speech-driven gesture synthesis.
arXiv Detail & Related papers (2024-08-07T23:23:50Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Co-Speech Gesture Detection through Multi-Phase Sequence Labeling [3.924524252255593]
We introduce a novel framework that reframes the task as a multi-phase sequence labeling problem.
We evaluate our proposal on a large dataset of diverse co-speech gestures in task-oriented face-to-face dialogues.
arXiv Detail & Related papers (2023-08-21T12:27:18Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Moving fast and slow: Analysis of representations and post-processing in
speech-driven automatic gesture generation [7.6857153840014165]
We extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning.
Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.
We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
arXiv Detail & Related papers (2020-07-16T07:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.