Related papers: Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation

URL: http://arxiv.org/abs/2309.05455v1
Date: Mon, 11 Sep 2023 13:51:06 GMT
Title: Diffusion-Based Co-Speech Gesture Generation Using Joint Text and Audio Representation
Authors: Anna Deichler, Shivam Mehta, Simon Alexanderson, Jonas Beskow
Abstract summary: This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model.
Score: 18.04996323708772
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: This paper describes a system developed for the GENEA (Generation and Evaluation of Non-verbal Behaviour for Embodied Agents) Challenge 2023. Our solution builds on an existing diffusion-based motion synthesis model. We propose a contrastive speech and motion pretraining (CSMP) module, which learns a joint embedding for speech and gesture with the aim to learn a semantic coupling between these modalities. The output of the CSMP module is used as a conditioning signal in the diffusion-based gesture synthesis model in order to achieve semantically-aware co-speech gesture generation. Our entry achieved highest human-likeness and highest speech appropriateness rating among the submitted entries. This indicates that our system is a promising approach to achieve human-like co-speech gestures in agents that carry semantic meaning.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
Social Agent: Mastering Dyadic Nonverbal Behavior Generation via Conversational LLM Agents [13.902411927285328]
Social Agent is a novel framework for synthesizing realistic and contextually appropriate co-speech nonverbal behaviors in dyadic conversations.<n>We develop an agentic system driven by a Large Language Model (LLM) to direct the conversation flow and determine appropriate interactive behaviors for both participants.<n>We propose a novel dual-person gesture generation model based on an auto-regressive diffusion model, which synthesizes coordinated motions from speech signals.
arXiv Detail & Related papers (2025-10-06T09:41:37Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z)
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning [0.6249768559720122]
We propose a novel approach for semantic grounding in co-speech gesture generation.<n>Our approach starts with learning the motion prior through a vector-quantized variational autoencoder.<n>Our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation.
arXiv Detail & Related papers (2025-07-25T15:10:15Z)
HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation [42.30003982604611]
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and strides in human communication. We propose a novel method named HOP for co-speech gesture generation, capturing heterogeneous entanglement between gesture motion, audio rhythm, and text semantics. HOP achieves state-of-the-art offering more natural and expressive co-speech gesture generation.
arXiv Detail & Related papers (2025-03-03T04:47:39Z)
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures. We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures. We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit. Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation. SpeechGPT-Gen is efficient in semantic and perceptual information modeling. It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z)
Speech-Gesture GAN: Gesture Generation for Robots and Embodied Agents [5.244401764969407]
Embodied agents, in the form of virtual agents or social robots, are rapidly becoming more widespread. We propose a novel framework that can generate sequences of joint angles from the speech text and speech audio utterances.
arXiv Detail & Related papers (2023-09-17T18:46:25Z)
A Survey on Audio Diffusion Models: Text To Speech Synthesis and Enhancement in Generative AI [64.71397830291838]
Generative AI has demonstrated impressive performance in various fields, among which speech synthesis is an interesting direction. With the diffusion model as the most popular generative model, numerous works have attempted two active tasks: text to speech and speech enhancement. This work conducts a survey on audio diffusion model, which is complementary to existing surveys.
arXiv Detail & Related papers (2023-03-23T15:17:15Z)
Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation [41.292644854306594]
We propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture) DiffGesture achieves state-of-theart performance, which renders coherent gestures with better mode coverage and stronger audio correlations.
arXiv Detail & Related papers (2023-03-16T07:32:31Z)
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)
Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech. During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
Gesticulator: A framework for semantically-aware speech-driven gesture generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together. Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output. The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.