Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with
Hierarchical Neural Embeddings
- URL: http://arxiv.org/abs/2210.01448v3
- Date: Thu, 4 May 2023 12:13:11 GMT
- Title: Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with
Hierarchical Neural Embeddings
- Authors: Tenglong Ao, Qingzhe Gao, Yuke Lou, Baoquan Chen, Libin Liu
- Abstract summary: We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics.
For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly.
For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory.
- Score: 27.352570417976153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic synthesis of realistic co-speech gestures is an increasingly
important yet challenging task in artificial embodied agent creation. Previous
systems mainly focus on generating gestures in an end-to-end manner, which
leads to difficulties in mining the clear rhythm and semantics due to the
complex yet subtle harmony between speech and gestures. We present a novel
co-speech gesture synthesis method that achieves convincing results both on the
rhythm and semantics. For the rhythm, our system contains a robust rhythm-based
segmentation pipeline to ensure the temporal coherence between the vocalization
and gestures explicitly. For the gesture semantics, we devise a mechanism to
effectively disentangle both low- and high-level neural embeddings of speech
and motion based on linguistic theory. The high-level embedding corresponds to
semantics, while the low-level embedding relates to subtle variations. Lastly,
we build correspondence between the hierarchical embeddings of the speech and
the motion, resulting in rhythm- and semantics-aware gesture synthesis.
Evaluations with existing objective metrics, a newly proposed rhythmic metric,
and human feedback show that our method outperforms state-of-the-art systems by
a clear margin.
Related papers
- Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal.
Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence.
We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z) - Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence.
Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit.
Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z) - Unified speech and gesture synthesis using flow matching [24.2094371314481]
This paper presents a novel, unified architecture for jointly synthesising speech acoustics and skeleton-based 3D gesture motion from text.
The proposed architecture is simpler than the previous state of the art, has a smaller memory footprint, and can capture the joint distribution of speech and gestures.
arXiv Detail & Related papers (2023-10-08T14:37:28Z) - LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation [41.42316077949012]
We introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation.
Our method decouples the task into two stages: script-based gesture generation and audio-guided rhythm refinement.
Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style.
arXiv Detail & Related papers (2023-09-17T15:06:11Z) - Revisiting Conversation Discourse for Dialogue Disentanglement [88.3386821205896]
We propose enhancing dialogue disentanglement by taking full advantage of the dialogue discourse characteristics.
We develop a structure-aware framework to integrate the rich structural features for better modeling the conversational semantic context.
Our work has great potential to facilitate broader multi-party multi-thread dialogue applications.
arXiv Detail & Related papers (2023-06-06T19:17:47Z) - Exploration strategies for articulatory synthesis of complex syllable
onsets [20.422871314256266]
High-quality articulatory speech synthesis has many potential applications in speech science and technology.
We construct an optimisation-based framework as a first step towards learning these mappings without manual intervention.
arXiv Detail & Related papers (2022-04-20T10:47:28Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Learning Hierarchical Cross-Modal Association for Co-Speech Gesture
Generation [107.10239561664496]
We propose a novel framework named Hierarchical Audio-to-Gesture (HA2G) for co-speech gesture generation.
The proposed method renders realistic co-speech gestures and outperforms previous methods in a clear margin.
arXiv Detail & Related papers (2022-03-24T16:33:29Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.