LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
- URL: http://arxiv.org/abs/2309.09294v1
- Date: Sun, 17 Sep 2023 15:06:11 GMT
- Title: LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation
- Authors: Yihao Zhi, Xiaodong Cun, Xuelin Chen, Xi Shen, Wen Guo, Shaoli Huang,
Shenghua Gao
- Abstract summary: We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
- Score: 41.42316077949012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Gestures are non-verbal but important behaviors accompanying people's speech.
While previous methods are able to generate speech rhythm-synchronized
gestures, the semantic context of the speech is generally lacking in the
gesticulations. Although semantic gestures do not occur very regularly in human
speech, they are indeed the key for the audience to understand the speech
context in a more immersive environment. Hence, we introduce LivelySpeaker, a
framework that realizes semantics-aware co-speech gesture generation and offers
several control handles. In particular, our method decouples the task into two
stages: script-based gesture generation and audio-guided rhythm refinement.
Specifically, the script-based gesture generation leverages the pre-trained
CLIP text embeddings as the guidance for generating gestures that are highly
semantically aligned with the script. Then, we devise a simple but effective
diffusion-based gesture generation backbone simply using pure MLPs, that is
conditioned on only audio signals and learns to gesticulate with realistic
motions. We utilize such powerful prior to rhyme the script-guided gestures
with the audio signals, notably in a zero-shot setting. Our novel two-stage
generation framework also enables several applications, such as changing the
gesticulation style, editing the co-speech gestures via textual prompting, and
controlling the semantic awareness and rhythm alignment with guided diffusion.
Extensive experiments demonstrate the advantages of the proposed framework over
competing methods. In addition, our core diffusion-based generative model also
achieves state-of-the-art performance on two benchmarks. The code and model
will be released to facilitate future research.
Related papers
- ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures.
Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise.
We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness [45.90256126021112]
We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
arXiv Detail & Related papers (2024-01-07T13:01:29Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.