SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
- URL: http://arxiv.org/abs/2507.19359v1
- Date: Fri, 25 Jul 2025 15:10:15 GMT
- Title: SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning
- Authors: Lanmiao Liu, Esam Ghaleb, Aslı Özyürek, Zerrin Yumak,
- Abstract summary: We propose a novel approach for semantic grounding in co-speech gesture generation.<n>Our approach starts with learning the motion prior through a vector-quantized variational autoencoder.<n>Our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation.
- Score: 0.6249768559720122
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Creating a virtual avatar with semantically coherent gestures that are aligned with speech is a challenging task. Existing gesture generation research mainly focused on generating rhythmic beat gestures, neglecting the semantic context of the gestures. In this paper, we propose a novel approach for semantic grounding in co-speech gesture generation that integrates semantic information at both fine-grained and global levels. Our approach starts with learning the motion prior through a vector-quantized variational autoencoder. Built on this model, a second-stage module is applied to automatically generate gestures from speech, text-based semantics and speaker identity that ensures consistency between the semantic relevance of generated gestures and co-occurring speech semantics through semantic coherence and relevance modules. Experimental results demonstrate that our approach enhances the realism and coherence of semantic gestures. Extensive experiments and user studies show that our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation in both objective and subjective metrics. The qualitative results of our model, code, dataset and pre-trained models can be viewed at https://semgesture.github.io/.
Related papers
- Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z) - Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild.<n>We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations.<n>We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z) - SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain [25.895313346558464]
We propose SARGes, a novel framework that leverages large language models to generate semantically meaningful gestures.<n> SARGes achieves highly semantically-aligned gesture labeling with efficient single-pass inference.<n>The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
arXiv Detail & Related papers (2025-03-26T03:55:41Z) - Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures.<n>We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures.<n>We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z) - Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal.
Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence.
We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - Iconic Gesture Semantics [87.00251241246136]
Informational evaluation is spelled out as extended exemplification (extemplification) in terms of perceptual classification of a gesture's visual iconic model.
We argue that the perceptual classification of instances of visual communication requires a notion of meaning different from Frege/Montague frameworks.
An iconic gesture semantics is introduced which covers the full range from gesture representations over model-theoretic evaluation to inferential interpretation in dynamic semantic frameworks.
arXiv Detail & Related papers (2024-04-29T13:58:03Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Speech Gesture Generation from the Trimodal Context of Text, Audio, and
Speaker Identity [21.61168067832304]
We present an automatic gesture generation model that uses the multimodal context of speech text, audio, and speaker identity to reliably generate gestures.
Experiments with the introduced metric and subjective human evaluation showed that the proposed gesture generation model is better than existing end-to-end generation models.
arXiv Detail & Related papers (2020-09-04T11:42:45Z) - Gesticulator: A framework for semantically-aware speech-driven gesture
generation [17.284154896176553]
We present a model designed to produce arbitrary beat and semantic gestures together.
Our deep-learning based model takes both acoustic and semantic representations of speech as input, and generates gestures as a sequence of joint angle rotations as output.
The resulting gestures can be applied to both virtual agents and humanoid robots.
arXiv Detail & Related papers (2020-01-25T14:42:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.