3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control
- URL: http://arxiv.org/abs/2601.18451v1
- Date: Mon, 26 Jan 2026 12:57:36 GMT
- Title: 3DGesPolicy: Phoneme-Aware Holistic Co-Speech Gesture Generation Based on Action Control
- Authors: Xuanmeng Sha, Liyun Zhang, Tomohiro Mashita, Naoya Chiba, Yuki Uranishi,
- Abstract summary: 3DGesPolicy is an action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem.<n>By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns.<n>To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module.
- Score: 3.606473077857744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating holistic co-speech gestures that integrate full-body motion with facial expressions suffers from semantically incoherent coordination on body motion and spatially unstable meaningless movements due to existing part-decomposed or frame-level regression methods, We introduce 3DGesPolicy, a novel action-based framework that reformulates holistic gesture generation as a continuous trajectory control problem through diffusion policy from robotics. By modeling frame-to-frame variations as unified holistic actions, our method effectively learns inter-frame holistic gesture motion patterns and ensures both spatially and semantically coherent movement trajectories that adhere to realistic motion manifolds. To further bridge the gap in expressive alignment, we propose a Gesture-Audio-Phoneme (GAP) fusion module that can deeply integrate and refine multi-modal signals, ensuring structured and fine-grained alignment between speech semantics, body motion, and facial expressions. Extensive quantitative and qualitative experiments on the BEAT2 dataset demonstrate the effectiveness of our 3DGesPolicy across other state-of-the-art methods in generating natural, expressive, and highly speech-aligned holistic gestures.
Related papers
- MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z) - Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning [56.6025512458557]
Motion-language retrieval aims to bridge the semantic gap between natural language and human motion.<n>Existing approaches predominantly focus on aligning entire motion sequences with global textual representations.<n>We propose a novel Pyramidal Shapley-Taylor (PST) learning framework for fine-grained motion-language retrieval.
arXiv Detail & Related papers (2026-01-29T16:00:12Z) - Towards Unified Co-Speech Gesture Generation via Hierarchical Implicit Periodicity Learning [13.132419390712807]
We argue that prevailing learning schemes fail to model crucial inter- and intra-correlations across different motion units.<n>We propose a unified Hierarchical Implicit Periodicity (HIP) learning approach for audio-inspired 3D gesture generation.
arXiv Detail & Related papers (2025-12-15T09:43:08Z) - InteracTalker: Prompt-Based Human-Object Interaction with Co-Speech Gesture Generation [1.7523719472700858]
We introduce InteracTalker, a novel framework that seamlessly integrates prompt-based object-aware interactions with co-speech gesture generation.<n>Our framework utilizes a Generalized Motion Adaptation Module that enables independent training, adapting to the corresponding motion condition.<n>InteracTalker successfully unifies these previously separate tasks, outperforming prior methods in both co-speech gesture generation and object-interaction synthesis.
arXiv Detail & Related papers (2025-12-14T12:29:49Z) - MoReact: Generating Reactive Motion from Textual Descriptions [57.642436102978245]
MoReact is a diffusion-based method designed to disentangle the generation of global trajectories and local motions sequentially.<n>Our experiments, utilizing data adapted from a two-person motion dataset, demonstrate the efficacy of our approach.
arXiv Detail & Related papers (2025-09-28T14:31:41Z) - HOP: Heterogeneous Topology-based Multimodal Entanglement for Co-Speech Gesture Generation [42.30003982604611]
Co-speech gestures are crucial non-verbal cues that enhance speech clarity and strides in human communication.<n>We propose a novel method named HOP for co-speech gesture generation, capturing heterogeneous entanglement between gesture motion, audio rhythm, and text semantics.<n>HOP achieves state-of-the-art offering more natural and expressive co-speech gesture generation.
arXiv Detail & Related papers (2025-03-03T04:47:39Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD)
The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences.
Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z) - DiverseMotion: Towards Diverse Human Motion Generation via Discrete
Diffusion [70.33381660741861]
We present DiverseMotion, a new approach for synthesizing high-quality human motions conditioned on textual descriptions.
We show that our DiverseMotion achieves the state-of-the-art motion quality and competitive motion diversity.
arXiv Detail & Related papers (2023-09-04T05:43:48Z) - Pose-Controllable 3D Facial Animation Synthesis using Hierarchical
Audio-Vertex Attention [52.63080543011595]
A novel pose-controllable 3D facial animation synthesis method is proposed by utilizing hierarchical audio-vertex attention.
The proposed method can produce more realistic facial expressions and head posture movements.
arXiv Detail & Related papers (2023-02-24T09:36:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.