Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
- URL: http://arxiv.org/abs/2410.13786v1
- Date: Thu, 17 Oct 2024 17:22:59 GMT
- Title: Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation
- Authors: Fengqi Liu, Hexiang Wang, Jingyu Gong, Ran Yi, Qianyu Zhou, Xuequan Lu, Jiangbo Lu, Lizhuang Ma,
- Abstract summary: Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal.
Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence.
We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
- Score: 44.78811546051805
- License:
- Abstract: Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence, ignoring the semantic association of different modalities and failing to deal with salient gestures. In this paper, we propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture. Specifically, we first learn a joint manifold space for the individual representation of audio and body pose to exploit the inherent semantic association between two modalities, and propose to enforce semantic consistency via a consistency loss. Furthermore, we emphasize the semantic consistency of salient postures by introducing a weakly-supervised detector to identify salient postures, and reweighting the consistency loss to focus more on learning the correspondence between salient postures and the high-level semantics of speech content. In addition, we propose to extract audio features dedicated to facial expression and body gesture separately, and design separate branches for face and body gesture synthesis. Extensive experimental results demonstrate the superiority of our method over the state-of-the-art approaches.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - Integrating Representational Gestures into Automatically Generated Embodied Explanations and its Effects on Understanding and Interaction Quality [0.0]
This study investigates how different types of gestures influence perceived interaction quality and listener understanding.
Our model combines beat gestures generated by a learned speech-driven module with manually captured iconic gestures.
Findings indicate that neither the use of iconic gestures alone nor their combination with beat gestures outperforms the baseline or beat-only conditions in terms of understanding.
arXiv Detail & Related papers (2024-06-18T12:23:00Z) - Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition [96.62264528407863]
We propose a self-supervised contrastive learning framework to excavate rich context via spatial-temporal consistency.
Inspired by the complementary property of motion and joint modalities, we first introduce first-order motion information into sign language modeling.
Our method is evaluated with extensive experiments on four public benchmarks, and achieves new state-of-the-art performance with a notable margin.
arXiv Detail & Related papers (2024-06-15T04:50:19Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with
Hierarchical Neural Embeddings [27.352570417976153]
We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics.
For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly.
For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory.
arXiv Detail & Related papers (2022-10-04T08:19:06Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Speech Drives Templates: Co-Speech Gesture Synthesis with Learned
Templates [30.32106465591015]
Co-speech gesture generation is to synthesize a gesture sequence that not only looks real but also matches with the input speech audio.
Our method generates the movements of a complete upper body, including arms, hands, and the head.
arXiv Detail & Related papers (2021-08-18T07:53:36Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.