CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
- URL: http://arxiv.org/abs/2511.22863v1
- Date: Fri, 28 Nov 2025 03:38:08 GMT
- Title: CoordSpeaker: Exploiting Gesture Captioning for Coordinated Caption-Empowered Co-Speech Gesture Generation
- Authors: Fengyi Fang, Sicheng Yang, Wenming Yang,
- Abstract summary: CoordSpeaker is a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis.<n>Our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions.
- Score: 44.84719308595376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Co-speech gesture generation has significantly advanced human-computer interaction, yet speaker movements remain constrained due to the omission of text-driven non-spontaneous gestures (e.g., bowing while talking). Existing methods face two key challenges: 1) the semantic prior gap due to the lack of descriptive text annotations in gesture datasets, and 2) the difficulty in achieving coordinated multimodal control over gesture generation. To address these challenges, this paper introduces CoordSpeaker, a comprehensive framework that enables coordinated caption-empowered co-speech gesture synthesis. Our approach first bridges the semantic prior gap through a novel gesture captioning framework, leveraging a motion-language model to generate descriptive captions at multiple granularities. Building upon this, we propose a conditional latent diffusion model with unified cross-dataset motion representation and a hierarchically controlled denoiser to achieve highly controlled, coordinated gesture generation. CoordSpeaker pioneers the first exploration of gesture understanding and captioning to tackle the semantic gap in gesture generation while offering a novel perspective of bidirectional gesture-text mapping. Extensive experiments demonstrate that our method produces high-quality gestures that are both rhythmically synchronized with speeches and semantically coherent with arbitrary captions, achieving superior performance with higher efficiency compared to existing approaches.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning [0.6249768559720122]
We propose a novel approach for semantic grounding in co-speech gesture generation.<n>Our approach starts with learning the motion prior through a vector-quantized variational autoencoder.<n>Our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation.
arXiv Detail & Related papers (2025-07-25T15:10:15Z) - Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild.<n>We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-speech-text associations.<n>We present a new approach that learns a tri-modal video-gesture-speech-text representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z) - Contextual Gesture: Co-Speech Gesture Video Generation through Context-aware Gesture Representation [18.980289980431426]
Contextual Gesture is a framework that improves co-speech gesture video generation through three innovative components.<n>Our experiments demonstrate that Contextual Gesture not only produces realistic and speech-aligned gesture videos but also supports long-sequence generation and video gesture editing applications.
arXiv Detail & Related papers (2025-02-11T04:09:12Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.<n>We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.<n>We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal.
Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence.
We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.