Audio2Gestures: Generating Diverse Gestures from Audio
- URL: http://arxiv.org/abs/2301.06690v1
- Date: Tue, 17 Jan 2023 04:09:58 GMT
- Title: Audio2Gestures: Generating Diverse Gestures from Audio
- Authors: Jing Li, Di Kang, Wenjie Pei, Xuefei Zhe, Ying Zhang, Linchao Bao,
Zhenyu He
- Abstract summary: We propose to explicitly model the one-to-many audio-to-motion mapping by splitting the cross-modal latent code into shared code and motion-specific code.
Our method generates more realistic and diverse motions than previous state-of-the-art methods.
- Score: 28.026220492342382
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: People may perform diverse gestures affected by various mental and physical
factors when speaking the same sentences. This inherent one-to-many
relationship makes co-speech gesture generation from audio particularly
challenging. Conventional CNNs/RNNs assume one-to-one mapping, and thus tend to
predict the average of all possible target motions, easily resulting in
plain/boring motions during inference. So we propose to explicitly model the
one-to-many audio-to-motion mapping by splitting the cross-modal latent code
into shared code and motion-specific code. The shared code is expected to be
responsible for the motion component that is more correlated to the audio while
the motion-specific code is expected to capture diverse motion information that
is more independent of the audio. However, splitting the latent code into two
parts poses extra training difficulties. Several crucial training
losses/strategies, including relaxed motion loss, bicycle constraint, and
diversity loss, are designed to better train the VAE.
Experiments on both 3D and 2D motion datasets verify that our method
generates more realistic and diverse motions than previous state-of-the-art
methods, quantitatively and qualitatively. Besides, our formulation is
compatible with discrete cosine transformation (DCT) modeling and other popular
backbones (\textit{i.e.} RNN, Transformer). As for motion losses and
quantitative motion evaluation, we find structured losses/metrics
(\textit{e.g.} STFT) that consider temporal and/or spatial context complement
the most commonly used point-wise losses (\textit{e.g.} PCK), resulting in
better motion dynamics and more nuanced motion details. Finally, we demonstrate
that our method can be readily used to generate motion sequences with
user-specified motion clips on the timeline.
Related papers
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - SpeechAct: Towards Generating Whole-body Motion from Speech [33.10601371020488]
This paper addresses the problem of generating whole-body motion from speech.
We present a novel hybrid point representation to achieve accurate and continuous motion generation.
We also propose a contrastive motion learning method to encourage the model to produce more distinctive representations.
arXiv Detail & Related papers (2023-11-29T07:57:30Z) - MoLo: Motion-augmented Long-short Contrastive Learning for Few-shot
Action Recognition [50.345327516891615]
We develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder.
MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching.
arXiv Detail & Related papers (2023-04-03T13:09:39Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - MotionBERT: A Unified Perspective on Learning Human Motion
Representations [46.67364057245364]
We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources.
We propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations.
We implement motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network.
arXiv Detail & Related papers (2022-10-12T19:46:25Z) - MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model [35.32967411186489]
MotionDiffuse is a diffusion model-based text-driven motion generation framework.
It excels at modeling complicated data distribution and generating vivid motion sequences.
It responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts.
arXiv Detail & Related papers (2022-08-31T17:58:54Z) - Weakly-supervised Action Transition Learning for Stochastic Human Motion
Prediction [81.94175022575966]
We introduce the task of action-driven human motion prediction.
It aims to predict multiple plausible future motions given a sequence of action labels and a short motion history.
arXiv Detail & Related papers (2022-05-31T08:38:07Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Audio2Gestures: Generating Diverse Gestures from Speech Audio with
Conditional Variational Autoencoders [29.658535633701035]
We propose a novel conditional variational autoencoder (VAE) that explicitly models one-to-many audio-to-motion mapping.
We show that our method generates more realistic and diverse motions than state-of-the-art methods, quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-15T11:15:51Z) - Neural Monocular 3D Human Motion Capture with Physical Awareness [76.55971509794598]
We present a new trainable system for physically plausible markerless 3D human motion capture.
Unlike most neural methods for human motion capture, our approach is aware of physical and environmental constraints.
It produces smooth and physically principled 3D motions in an interactive frame rate in a wide variety of challenging scenes.
arXiv Detail & Related papers (2021-05-03T17:57:07Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.