Moving fast and slow: Analysis of representations and post-processing in
speech-driven automatic gesture generation
- URL: http://arxiv.org/abs/2007.09170v3
- Date: Thu, 28 Jan 2021 12:49:17 GMT
- Title: Moving fast and slow: Analysis of representations and post-processing in
speech-driven automatic gesture generation
- Authors: Taras Kucherenko, Dai Hasegawa, Naoshi Kaneko, Gustav Eje Henter,
Hedvig Kjellstr\"om
- Abstract summary: We extend recent deep-learning-based, data-driven methods for speech-driven gesture generation by incorporating representation learning.
Our model takes speech as input and produces gestures as output, in the form of a sequence of 3D coordinates.
We conclude that it is important to take both motion representation and post-processing into account when designing an automatic gesture-production method.
- Score: 7.6857153840014165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel framework for speech-driven gesture production,
applicable to virtual agents to enhance human-computer interaction.
Specifically, we extend recent deep-learning-based, data-driven methods for
speech-driven gesture generation by incorporating representation learning. Our
model takes speech as input and produces gestures as output, in the form of a
sequence of 3D coordinates. We provide an analysis of different representations
for the input (speech) and the output (motion) of the network by both objective
and subjective evaluations. We also analyse the importance of smoothing of the
produced motion. Our results indicated that the proposed method improved on our
baseline in terms of objective measures. For example, it better captured the
motion dynamics and better matched the motion-speed distribution. Moreover, we
performed user studies on two different datasets. The studies confirmed that
our proposed method is perceived as more natural than the baseline, although
the difference in the studies was eliminated by appropriate post-processing:
hip-centering and smoothing. We conclude that it is important to take both
motion representation and post-processing into account when designing an
automatic gesture-production method.
Related papers
- DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length.
By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z) - From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations [107.88375243135579]
Given speech audio, we output multiple possibilities of gestural motion for an individual, including face, body, and hands.
We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures.
Experiments show our model generates appropriate and diverse gestures, outperforming both diffusion- and VQ-only methods.
arXiv Detail & Related papers (2024-01-03T18:55:16Z) - Motion Flow Matching for Human Motion Synthesis and Editing [75.13665467944314]
We propose emphMotion Flow Matching, a novel generative model for human motion generation featuring efficient sampling and effectiveness in motion editing applications.
Our method reduces the sampling complexity from thousand steps in previous diffusion models to just ten steps, while achieving comparable performance in text-to-motion and action-to-motion generation benchmarks.
arXiv Detail & Related papers (2023-12-14T12:57:35Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.
We present a novel hybrid point representation to achieve accurate and continuous motion generation.
We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - AQ-GT: a Temporally Aligned and Quantized GRU-Transformer for Co-Speech
Gesture Synthesis [0.0]
We present an approach to pre-train partial gesture sequences using a generative adversarial network with a quantization pipeline.
By learning the mapping of a latent space representation as opposed to directly mapping it to a vector representation, this framework facilitates the generation of highly realistic and expressive gestures.
arXiv Detail & Related papers (2023-05-02T07:59:38Z) - ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model [33.64263969970544]
3D human motion generation is crucial for creative industry.
Recent advances rely on generative models with domain knowledge for text-driven motion generation.
We propose ReMoDiffuse, a diffusion-model-based motion generation framework.
arXiv Detail & Related papers (2023-04-03T16:29:00Z) - Co-Speech Gesture Synthesis using Discrete Gesture Token Learning [1.1694169299062596]
Synthesizing realistic co-speech gestures is an important and yet unsolved problem for creating believable motions.
One challenge in learning the co-speech gesture model is that there may be multiple viable gesture motions for the same speech utterance.
We proposed a two-stage model to address this uncertainty issue in gesture synthesis by modeling the gesture segments as discrete latent codes.
arXiv Detail & Related papers (2023-03-04T01:42:09Z) - Task Formulation Matters When Learning Continually: A Case Study in
Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge.
We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z) - Improving Generation and Evaluation of Visual Stories via Semantic
Consistency [72.00815192668193]
Given a series of natural language captions, an agent must generate a sequence of images that correspond to the captions.
Prior work has introduced recurrent generative models which outperform synthesis text-to-image models on this task.
We present a number of improvements to prior modeling approaches, including the addition of a dual learning framework.
arXiv Detail & Related papers (2021-05-20T20:42:42Z) - Hierarchical Style-based Networks for Motion Synthesis [150.226137503563]
We propose a self-supervised method for generating long-range, diverse and plausible behaviors to achieve a specific goal location.
Our proposed method learns to model the motion of human by decomposing a long-range generation task in a hierarchical manner.
On large-scale skeleton dataset, we show that the proposed method is able to synthesise long-range, diverse and plausible motion.
arXiv Detail & Related papers (2020-08-24T02:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.