Audio2Gestures: Generating Diverse Gestures from Speech Audio with
Conditional Variational Autoencoders
- URL: http://arxiv.org/abs/2108.06720v1
- Date: Sun, 15 Aug 2021 11:15:51 GMT
- Title: Audio2Gestures: Generating Diverse Gestures from Speech Audio with
Conditional Variational Autoencoders
- Authors: Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Zhenyu He,
Linchao Bao
- Abstract summary: We propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping.
We show that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively.
- Score: 29.658535633701035
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating conversational gestures from speech audio is challenging due to
the inherent one-to-many mapping between audio and body motions. Conventional
CNNs/RNNs assume one-to-one mapping, and thus tend to predict the average of
all possible target motions, resulting in plain/boring motions during
inference. In order to overcome this problem, we propose a novel conditional
variational autoencoder (VAE) that explicitly models one-to-many
audio-to-motion mapping by splitting the cross-modal latent code into shared
code and motion-specific code. The shared code mainly models the strong
correlation between audio and motion (such as the synchronized audio and motion
beats), while the motion-specific code captures diverse motion information
independent of the audio. However, splitting the latent code into two parts
poses training difficulties for the VAE model. A mapping network facilitating
random sampling along with other techniques including relaxed motion loss,
bicycle constraint, and diversity loss are designed to better train the VAE.
Experiments on both 3D and 2D motion datasets verify that our method generates
more realistic and diverse motions than state-of-the-art methods,
quantitatively and qualitatively. Finally, we demonstrate that our method can
be readily used to generate motion sequences with user-specified motion clips
on the timeline. Code and more results are at
https://jingli513.github.io/audio2gestures.
Related papers
- Loopy: Taming Audio-Driven Portrait Avatar with Long-Term Motion Dependency [15.841490425454344]
We propose an end-to-end audio-only conditioned video diffusion model named Loopy.
Specifically, we designed an inter- and intra-clip temporal module and an audio-to-latents module, enabling the model to leverage long-term motion information.
arXiv Detail & Related papers (2024-09-04T11:55:14Z) - SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos [77.55518265996312]
We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos.
Our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio, language, and vision when all modality pairs agree.
arXiv Detail & Related papers (2024-04-08T05:19:28Z) - VMC: Video Motion Customization using Temporal Attention Adaption for
Text-to-Video Diffusion Models [58.93124686141781]
Video Motion Customization (VMC) is a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models.
Our approach introduces a novel motion distillation objective using residual vectors between consecutive frames as a motion reference.
We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts.
arXiv Detail & Related papers (2023-12-01T06:50:11Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.
We present a novel hybrid point representation to achieve accurate and continuous motion generation.
We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - Audio2Gestures: Generating Diverse Gestures from Audio [28.026220492342382]
We propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code.
Our method generates more realistic and diverse motions than previous state-of-the-art methods.
arXiv Detail & Related papers (2023-01-17T04:09:58Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.