Related papers: GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPT

GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPT

URL: http://arxiv.org/abs/2303.13013v3
Date: Tue, 28 May 2024 02:53:44 GMT
Title: GesGPT: Speech Gesture Synthesis With Text Parsing from ChatGPT
Authors: Nan Gao, Zeyu Zhao, Zhi Zeng, Shuwu Zhang, Dongdong Weng, Yihua Bao,
Abstract summary: Gesture synthesis has gained significant attention as a critical research field. Deep learning-based approaches have achieved remarkable progress, but they often overlook the rich semantic information present in the text. We propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of large language models.
Score: 8.18076191897917
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Gesture synthesis has gained significant attention as a critical research field, aiming to produce contextually appropriate and natural gestures corresponding to speech or textual input. Although deep learning-based approaches have achieved remarkable progress, they often overlook the rich semantic information present in the text, leading to less expressive and meaningful gestures. In this letter, we propose GesGPT, a novel approach to gesture generation that leverages the semantic analysis capabilities of large language models , such as ChatGPT. By capitalizing on the strengths of LLMs for text analysis, we adopt a controlled approach to generate and integrate professional gestures and base gestures through a text parsing script, resulting in diverse and meaningful gestures. Firstly, our approach involves the development of prompt principles that transform gesture generation into an intention classification problem using ChatGPT. We also conduct further analysis on emphasis words and semantic words to aid in gesture generation. Subsequently, we construct a specialized gesture lexicon with multiple semantic annotations, decoupling the synthesis of gestures into professional gestures and base gestures. Finally, we merge the professional gestures with base gestures. Experimental results demonstrate that GesGPT effectively generates contextually appropriate and expressive gestures.

Related papers

Modeling Turn-Taking with Semantically Informed Gestures [56.31369237947851]
We introduce DnD Gesture++, an extension of the multi-party DnD Gesture corpus enriched with 2,663 semantic gesture annotations.<n>We model turn-taking prediction through a Mixture-of-Experts framework integrating text, audio, and gestures.<n>Experiments show that incorporating semantically guided gestures yields consistent performance gains over baselines.
arXiv Detail & Related papers (2025-10-22T08:17:54Z)
ImaGGen: Zero-Shot Generation of Co-Speech Semantic Gestures Grounded in Language and Image Input [0.0]
This paper tackles a core challenge in co-speech gesture synthesis: generating iconic or deictic gestures that are semantically coherent with a verbal utterance.<n>We introduce a zero-shot system that generates gestures from a given language input and additionally is informed by imagistic input, without manual annotation or human intervention.<n>Our results highlight the importance of context-aware semantic gestures for creating expressive and collaborative virtual agents or avatars.
arXiv Detail & Related papers (2025-10-20T15:01:56Z)
SemGes: Semantics-aware Co-Speech Gesture Generation using Semantic Coherence and Relevance Learning [0.6249768559720122]
We propose a novel approach for semantic grounding in co-speech gesture generation.<n>Our approach starts with learning the motion prior through a vector-quantized variational autoencoder.<n>Our method outperforms state-of-the-art approaches across two benchmarks in co-speech gesture generation.
arXiv Detail & Related papers (2025-07-25T15:10:15Z)
Intentional Gesture: Deliver Your Intentions with Gestures for Speech [47.34315450130868]
Intentional-Gesture is a novel framework that casts gesture generation as an intention-reasoning task grounded in high-level communicative functions.<n>Our framework offers a modular foundation for expressive gesture generation in digital humans and embodied AI.
arXiv Detail & Related papers (2025-05-21T07:24:51Z)
Understanding Co-speech Gestures in-the-wild [52.5993021523165]
We introduce a new framework for co-speech gesture understanding in the wild. We propose three new tasks and benchmarks to evaluate a model's capability to comprehend gesture-text-speech associations. We present a new approach that learns a tri-modal speech-text-video-gesture representation to solve these tasks.
arXiv Detail & Related papers (2025-03-28T17:55:52Z)
SARGes: Semantically Aligned Reliable Gesture Generation via Intent Chain [25.895313346558464]
We propose SARGes, a novel framework that leverages large language models to generate semantically meaningful gestures. SARGes achieves highly semantically-aligned gesture labeling with efficient single-pass inference. The proposed method provides an interpretable intent reasoning pathway for semantic gesture synthesis.
arXiv Detail & Related papers (2025-03-26T03:55:41Z)
Enhancing Spoken Discourse Modeling in Language Models Using Gestural Cues [56.36041287155606]
We investigate whether the joint modeling of gestures using human motion sequences and language can improve spoken discourse modeling. To integrate gestures into language models, we first encode 3D human motion sequences into discrete gesture tokens using a VQ-VAE. Results show that incorporating gestures enhances marker prediction accuracy across the three tasks.
arXiv Detail & Related papers (2025-03-05T13:10:07Z)
Retrieving Semantics from the Deep: an RAG Solution for Gesture Synthesis [55.45253486141108]
RAG-Gesture is a diffusion-based gesture generation approach to produce semantically rich gestures. We achieve this by using explicit domain knowledge to retrieve motions from a database of co-speech gestures. We propose a control paradigm for guidance, that allows the users to modulate the amount of influence each retrieval insertion has over the generated sequence.
arXiv Detail & Related papers (2024-12-09T18:59:46Z)
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence. We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z)
ExpGest: Expressive Speaker Generation Using Diffusion Model and Hybrid Audio-Text Guidance [11.207513771079705]
We introduce ExpGest, a novel framework leveraging synchronized text and audio information to generate expressive full-body gestures. Unlike AdaIN or one-hot encoding methods, we design a noise emotion classifier for optimizing adversarial direction noise. We show that ExpGest achieves more expressive, natural, and controllable global motion in speakers compared to state-of-the-art models.
arXiv Detail & Related papers (2024-10-12T07:01:17Z)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit. Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z)
Dynamic Typography: Bringing Text to Life via Video Diffusion Prior [73.72522617586593]
We present an automated text animation scheme, termed "Dynamic Typography" It deforms letters to convey semantic meaning and infuses them with vibrant movements based on user prompts. Our technique harnesses vector graphics representations and an end-to-end optimization-based framework.
arXiv Detail & Related papers (2024-04-17T17:59:55Z)
ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z)
LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation. Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement. Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z)
QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation [8.604430209445695]
Speech-driven gesture generation is highly challenging due to the random jitters of human motion. We introduce a novel quantization-based and phase-guided motion-matching framework. Our method outperforms recent approaches on speech-driven gesture generation.
arXiv Detail & Related papers (2023-05-18T16:31:25Z)
Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation. Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics. We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z)
Vision-Language Pre-Training for Boosting Scene Text Detectors [57.08046351495244]
We specifically adapt vision-language joint learning for scene text detection. We propose to learn contextualized, joint representations through vision-language pre-training. The pre-trained model is able to produce more informative representations with richer semantics.
arXiv Detail & Related papers (2022-04-29T03:53:54Z)
Learning Hierarchical Cross-Modal Association for Co-Speech Gesture Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation. The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.