UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons
- URL: http://arxiv.org/abs/2309.07051v1
- Date: Wed, 13 Sep 2023 16:07:25 GMT
- Title: UnifiedGesture: A Unified Gesture Synthesis Model for Multiple Skeletons
- Authors: Sicheng Yang, Zilin Wang, Zhiyong Wu, Minglei Li, Zhensong Zhang,
Qiaochu Huang, Lei Hao, Songcen Xu, Xiaofei Wu, changpeng yang, Zonghong Dai
- Abstract summary: We present a novel diffusion model-based speech-driven gesture synthesis approach, trained on multiple gesture datasets with different skeletons.
We then capture the correlation between speech and gestures based on a diffusion model architecture using cross-local attention and self-attention.
Experiments show that UnifiedGesture outperforms recent approaches on speech-driven gesture generation in terms of CCA, FGD, and human-likeness.
- Score: 16.52004713662265
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The automatic co-speech gesture generation draws much attention in computer
animation. Previous works designed network structures on individual datasets,
which resulted in a lack of data volume and generalizability across different
motion capture standards. In addition, it is a challenging task due to the weak
correlation between speech and gestures. To address these problems, we present
UnifiedGesture, a novel diffusion model-based speech-driven gesture synthesis
approach, trained on multiple gesture datasets with different skeletons.
Specifically, we first present a retargeting network to learn latent
homeomorphic graphs for different motion capture standards, unifying the
representations of various gestures while extending the dataset. We then
capture the correlation between speech and gestures based on a diffusion model
architecture using cross-local attention and self-attention to generate better
speech-matched and realistic gestures. To further align speech and gesture and
increase diversity, we incorporate reinforcement learning on the discrete
gesture units with a learned reward function. Extensive experiments show that
UnifiedGesture outperforms recent approaches on speech-driven gesture
generation in terms of CCA, FGD, and human-likeness. All code, pre-trained
models, databases, and demos are available to the public at
https://github.com/YoungSeng/UnifiedGesture.
Related papers
- Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures.
We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset.
Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z) - BodyFormer: Semantics-guided 3D Body Gesture Synthesis with Transformer [42.87095473590205]
We propose a novel framework for automatic 3D body gesture synthesis from speech.
Our system is trained with either the Trinity speech-gesture dataset or the Talking With Hands 16.2M dataset.
The results show that our system can produce more realistic, appropriate, and diverse body gestures compared to existing state-of-the-art approaches.
arXiv Detail & Related papers (2023-09-07T01:11:11Z) - Audio is all in one: speech-driven gesture synthetics using WavLM pre-trained model [2.827070255699381]
diffmotion-v2 is a speech-conditional diffusion-based generative model with WavLM pre-trained model.
It can produce individual and stylized full-body co-speech gestures only using raw speech audio.
arXiv Detail & Related papers (2023-08-11T08:03:28Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z) - General-Purpose Speech Representation Learning through a Self-Supervised
Multi-Granularity Framework [114.63823178097402]
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
Specifically, we propose to use generative learning approaches to capture fine-grained information at small time scales and use discriminative learning approaches to distill coarse-grained or semantic information at large time scales.
arXiv Detail & Related papers (2021-02-03T08:13:21Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.