Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness
- URL: http://arxiv.org/abs/2401.03476v1
- Date: Sun, 7 Jan 2024 13:01:29 GMT
- Title: Freetalker: Controllable Speech and Text-Driven Gesture Generation Based
on Diffusion Models for Enhanced Speaker Naturalness
- Authors: Sicheng Yang, Zunnan Xu, Haiwei Xue, Yongkang Cheng, Shaoli Huang,
Mingming Gong, Zhiyong Wu
- Abstract summary: We introduce FreeTalker, which is the first framework for the generation of both spontaneous (e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium) speaker motions.
Specifically, we train a diffusion-based model for speaker motion generation that employs unified representations of both speech-driven gestures and text-driven motions.
- Score: 45.90256126021112
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current talking avatars mostly generate co-speech gestures based on audio and
text of the utterance, without considering the non-speaking motion of the
speaker. Furthermore, previous works on co-speech gesture generation have
designed network structures based on individual gesture datasets, which results
in limited data volume, compromised generalizability, and restricted speaker
movements. To tackle these issues, we introduce FreeTalker, which, to the best
of our knowledge, is the first framework for the generation of both spontaneous
(e.g., co-speech gesture) and non-spontaneous (e.g., moving around the podium)
speaker motions. Specifically, we train a diffusion-based model for speaker
motion generation that employs unified representations of both speech-driven
gestures and text-driven motions, utilizing heterogeneous data sourced from
various motion datasets. During inference, we utilize classifier-free guidance
to highly control the style in the clips. Additionally, to create smooth
transitions between clips, we utilize DoubleTake, a method that leverages a
generative prior and ensures seamless motion blending. Extensive experiments
show that our method generates natural and controllable speaker movements. Our
code, model, and demo are are available at
\url{https://youngseng.github.io/FreeTalker/}.
Related papers
- ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures.
Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise.
We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z) - DiffTED: One-shot Audio-driven TED Talk Video Generation with Diffusion-based Co-speech Gestures [27.763304632981882]
We introduce DiffTED, a new approach for one-shot audio-driven talking video generation from a single image.
We leverage a diffusion model to generate sequences of keypoints for a Thin-Plate Spline motion model.
Experiments demonstrate that DiffTED generates temporally coherent talking videos with diverse co-speech gestures.
arXiv Detail & Related papers (2024-09-11T22:31:55Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Zero-Shot Style Transfer for Gesture Animation driven by Text and Speech
using Adversarial Disentanglement of Multimodal Style Encoding [3.2116198597240846]
We propose an efficient yet effective machine learning approach to synthesize gestures driven by prosodic features and text in the style of different speakers.
Our model performs zero shot multimodal style transfer driven by multimodal data from the PATS database containing videos of various speakers.
arXiv Detail & Related papers (2022-08-03T08:49:55Z) - Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations.
We autoregressively output multiple possibilities of corresponding listener motion.
Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.