Intentional Gesture: Deliver Your Intentions with Gestures for Speech
- URL: http://arxiv.org/abs/2505.15197v1
- Date: Wed, 21 May 2025 07:24:51 GMT
- Title: Intentional Gesture: Deliver Your Intentions with Gestures for Speech
- Authors: Pinxin Liu, Haiyang Liu, Luchuan Song, Chenliang Xu,
- Abstract summary: textbfIntentional-Gesture casts gesture generation as an intention-reasoning task grounded in high-level communicative functions.<n>Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI.
- Score: 34.4247439245772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When humans speak, gestures help convey communicative intentions, such as adding emphasis or describing concepts. However, current co-speech gesture generation methods rely solely on superficial linguistic cues (\textit{e.g.} speech audio or text transcripts), neglecting to understand and leverage the communicative intention that underpins human gestures. This results in outputs that are rhythmically synchronized with speech but are semantically shallow. To address this gap, we introduce \textbf{Intentional-Gesture}, a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions. % First, we curate the \textbf{InG} dataset by augmenting BEAT-2 with gesture-intention annotations (\textit{i.e.}, text sentences summarizing intentions), which are automatically annotated using large vision-language models. Next, we introduce the \textbf{Intentional Gesture Motion Tokenizer} to leverage these intention annotations. It injects high-level communicative functions (\textit{e.g.}, intentions) into tokenized motion representations to enable intention-aware gesture synthesis that are both temporally aligned and semantically meaningful, achieving new state-of-the-art performance on the BEAT-2 benchmark. Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI. Project Page: https://andypinxinliu.github.io/Intentional-Gesture
Related papers
- SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning [0.6249768559720122]
We propose a novel approach for semantic grounding in co-speech gesture generation.<n>Our approach starts with learning the motion prior through a vector-quantized variational autoencoder.<n>Our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation.
arXiv Detail & Related papers (2025-07-25T15:10:15Z) - Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild.<n>We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations.<n>We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z) - Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues [56.36041287155606]
We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling.<n>To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE.<n>Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
arXiv Detail & Related papers (2025-03-05T13:10:07Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.<n>We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.<n>We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - QPGesture: Quantization-Based and Phase-Guided Motion Matching for
Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion.
We introduce a novel quantization-based and phase-guided motion-matching framework.
Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.