Related papers: SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis

URL: http://arxiv.org/abs/2412.16563v2
Date: Wed, 15 Jan 2025 13:34:12 GMT
Title: SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis
Authors: Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Ziqiang Dang, Jianqiang Ren, Liefeng Bo, Zhigang Tu,
Abstract summary: A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. We propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis.
Score: 19.764460501254607
License:
Abstract: A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion. In this work, we propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis. Our key insight is to separately learn general motions and sparse motions, and then adaptively fuse them. In particular, rhythmic consistency learning is explored to establish rhythm-related base motion, ensuring a coherent foundation that synchronizes gestures with the speech rhythm. Subsequently, textit{semantic emphasis learning is designed to generate semantic-aware sparse motion, focusing on frame-level semantic cues. Finally, to integrate sparse motion into the base motion and generate semantic-emphasized co-speech gestures, we further leverage a learned semantic score for adaptive synthesis. Qualitative and quantitative comparisons on two public datasets demonstrate that our method outperforms the state-of-the-art, delivering high-quality co-speech motion with enhanced semantic richness over a stable base motion.

Related papers

KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis. We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z)
Neuron: Learning Context-Aware Evolving Representations for Zero-Shot Skeleton Action Recognition [64.56321246196859]
We propose a novel dyNamically Evolving dUal skeleton-semantic syneRgistic framework. We first construct the spatial-temporal evolving micro-prototypes and integrate dynamic context-aware side information. We introduce the spatial compression and temporal memory mechanisms to guide the growth of spatial-temporal micro-prototypes.
arXiv Detail & Related papers (2024-11-18T05:16:11Z)
Emphasizing Semantic Consistency of Salient Posture for Speech-Driven Gesture Generation [44.78811546051805]
Speech-driven gesture generation aims at synthesizing a gesture sequence synchronized with the input speech signal. Previous methods leverage neural networks to directly map a compact audio representation to the gesture sequence. We propose a novel speech-driven gesture generation method by emphasizing the semantic consistency of salient posture.
arXiv Detail & Related papers (2024-10-17T17:22:59Z)
Semantic Gesticulator: Semantics-Aware Co-Speech Gesture Synthesis [25.822870767380685]
We present Semantic Gesticulator, a framework designed to synthesize realistic gestures with strong semantic correspondence. Our system demonstrates robustness in generating gestures that are rhythmically coherent and semantically explicit. Our system outperforms state-of-the-art systems in terms of semantic appropriateness by a clear margin.
arXiv Detail & Related papers (2024-05-16T05:09:01Z)
Semantics-aware Motion Retargeting with Vision-Language Models [19.53696208117539]
We present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions and the high-level motion semantics are incorporated into the motion process by feeding the vision-language model and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics, we adopt two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints.
arXiv Detail & Related papers (2023-12-04T15:23:49Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
Rhythmic Gesticulator: Rhythm-Aware Co-Speech Gesture Synthesis with Hierarchical Neural Embeddings [27.352570417976153]
We present a novel co-speech gesture synthesis method that achieves convincing results both on the rhythm and semantics. For the rhythm, our system contains a robust rhythm-based segmentation pipeline to ensure the temporal coherence between the vocalization and gestures explicitly. For the gesture semantics, we devise a mechanism to effectively disentangle both low- and high-level neural embeddings of speech and motion based on linguistic theory.
arXiv Detail & Related papers (2022-10-04T08:19:06Z)
Curriculum Learning for Goal-Oriented Semantic Communications with a Common Language [60.85719227557608]
A holistic goal-oriented semantic communication framework is proposed to enable a speaker and a listener to cooperatively execute a set of sequential tasks. A common language based on a hierarchical belief set is proposed to enable semantic communications between speaker and listener. An optimization problem is defined to determine the perfect and abstract description of the events.
arXiv Detail & Related papers (2022-04-21T22:36:06Z)
Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions. We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture. Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.