EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation
- URL: http://arxiv.org/abs/2504.09209v2
- Date: Tue, 15 Apr 2025 15:41:20 GMT
- Title: EchoMask: Speech-Queried Attention-based Mask Modeling for Holistic Co-Speech Motion Generation
- Authors: Xiangyue Zhang, Jianfang Li, Jiaxu Zhang, Jianqiang Ren, Liefeng Bo, Zhigang Tu,
- Abstract summary: We propose a speech-queried attention-based mask modeling framework for co-speech motion generation.<n>Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process.
- Score: 20.755563616260016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Masked modeling framework has shown promise in co-speech motion generation. However, it struggles to identify semantically significant frames for effective motion masking. In this work, we propose a speech-queried attention-based mask modeling framework for co-speech motion generation. Our key insight is to leverage motion-aligned speech features to guide the masked motion modeling process, selectively masking rhythm-related and semantically expressive motion frames. Specifically, we first propose a motion-audio alignment module (MAM) to construct a latent motion-audio joint space. In this space, both low-level and high-level speech features are projected, enabling motion-aligned speech representation using learnable speech queries. Then, a speech-queried attention mechanism (SQA) is introduced to compute frame-level attention scores through interactions between motion keys and speech queries, guiding selective masking toward motion frames with high attention scores. Finally, the motion-aligned speech features are also injected into the generation network to facilitate co-speech motion generation. Qualitative and quantitative evaluations confirm that our method outperforms existing state-of-the-art approaches, successfully producing high-quality co-speech motion.
Related papers
- FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.
In the first stage, we employ a clip-level training scheme to establish coherent global motion.
In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z) - Audio-driven Gesture Generation via Deviation Feature in the Latent Space [2.8952735126314733]
We introduce a weakly supervised framework that learns latent representation deviations, tailored for co-speech gesture video generation.<n>Our approach employs a diffusion model to integrate latent motion features, enabling more precise and nuanced gesture representation.<n> Experiments show our method significantly improves video quality, surpassing current state-of-the-art techniques.
arXiv Detail & Related papers (2025-03-27T15:37:16Z) - Through-The-Mask: Mask-based Motion Trajectories for Image-to-Video Generation [52.337472185022136]
We consider the task of Image-to-Video (I2V) generation, which involves transforming static images into realistic video sequences based on a textual description.<n>We propose a two-stage compositional framework that decomposes I2V generation into: (i) An explicit intermediate representation generation stage, followed by (ii) A video generation stage that is conditioned on this representation.<n>We evaluate our method on challenging benchmarks with multi-object and high-motion scenarios and empirically demonstrate that the proposed method achieves state-of-the-art consistency.
arXiv Detail & Related papers (2025-01-06T14:49:26Z) - SemTalk: Holistic Co-speech Motion Generation with Frame-level Semantic Emphasis [19.764460501254607]
A good co-speech motion generation cannot be achieved without a careful integration of common rhythmic motion and rare yet essential semantic motion.<n>We propose SemTalk for holistic co-speech motion generation with frame-level semantic emphasis.
arXiv Detail & Related papers (2024-12-21T10:16:07Z) - Dynamic Motion Synthesis: Masked Audio-Text Conditioned Spatio-Temporal Transformers [13.665279127648658]
This research presents a novel motion generation framework designed to produce whole-body motion sequences conditioned on multiple modalities simultaneously.
By integrating spatial attention mechanisms and a token critic we ensure consistency and naturalness in the generated motions.
arXiv Detail & Related papers (2024-09-03T04:19:27Z) - Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction.
We present a novel motion-decoupled framework to generate co-speech gesture videos.
Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z) - EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling [57.08286593059137]
We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures.
We first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset.
Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance.
arXiv Detail & Related papers (2023-12-31T02:25:41Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis.
We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework.
We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z) - Audio-Driven Co-Speech Gesture Video Generation [92.15661971086746]
We define and study this challenging problem of audio-driven co-speech gesture video generation.
Our key insight is that the co-speech gestures can be decomposed into common motion patterns and subtle rhythmic dynamics.
We propose a novel framework, Audio-driveN Gesture vIdeo gEneration (ANGIE), to effectively capture the reusable co-speech gesture patterns.
arXiv Detail & Related papers (2022-12-05T15:28:22Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.