MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation
- URL: http://arxiv.org/abs/2510.13208v1
- Date: Wed, 15 Oct 2025 06:53:15 GMT
- Title: MimicParts: Part-aware Style Injection for Speech-Driven 3D Motion Generation
- Authors: Lianlian Liu, YongKang He, Zhaojie Chu, Xiaofen Xing, Xiangmin Xu,
- Abstract summary: MimicParts is a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network.<n>It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences.<n>Our method outperforms existing methods showcasing naturalness and expressive 3D human motion sequences.
- Score: 30.215940521087642
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Generating stylized 3D human motion from speech signals presents substantial challenges, primarily due to the intricate and fine-grained relationships among speech signals, individual styles, and the corresponding body movements. Current style encoding approaches either oversimplify stylistic diversity or ignore regional motion style differences (e.g., upper vs. lower body), limiting motion realism. Additionally, motion style should dynamically adapt to changes in speech rhythm and emotion, but existing methods often overlook this. To address these issues, we propose MimicParts, a novel framework designed to enhance stylized motion generation based on part-aware style injection and part-aware denoising network. It divides the body into different regions to encode localized motion styles, enabling the model to capture fine-grained regional differences. Furthermore, our part-aware attention block allows rhythm and emotion cues to guide each body region precisely, ensuring that the generated motion aligns with variations in speech rhythm and emotional state. Experimental results show that our method outperforming existing methods showcasing naturalness and expressive 3D human motion sequences.
Related papers
- Dance Style Classification using Laban-Inspired and Frequency-Domain Motion Features [0.13048920509133805]
We present a framework for classifying dance styles based on pose estimates extracted from videos.<n>These features capture local joint dynamics such as velocity, acceleration, and angular movement of the upper body.<n>To further encode rhythmic and periodic aspects of movement, we integrate Fast Fourier Transform features that characterize movement patterns in the frequency domain.
arXiv Detail & Related papers (2025-11-25T16:33:45Z) - SceneAdapt: Scene-aware Adaptation of Human Motion Diffusion [74.70024991949269]
We introduce SceneAdapt, a framework that injects scene awareness into text-conditioned motion models.<n>Key idea is to use motion inbetweening, learnable without text, as a proxy task to bridge two distinct datasets.<n>Results show that SceneAdapt effectively injects scene awareness into text-to-motion models.
arXiv Detail & Related papers (2025-10-14T23:42:10Z) - SMooGPT: Stylized Motion Generation using Large Language Models [23.476473154719514]
Stylized motion generation is actively studied in computer graphics, especially benefiting from the rapid advances in diffusion models.<n>Existing research attempts to address this problem via motion style transfer or conditional motion generation.<n>We propose utilizing body-part text space as an intermediate representation, and present SMooGPT.
arXiv Detail & Related papers (2025-09-04T09:41:18Z) - Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation [69.50178144839275]
Singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics.<n>Existing speech-driven approaches often produce oversimplified, emotionally flat, and semantically inconsistent results.<n>Think2Sing generates semantically coherent and temporally consistent 3D head animations conditioned on both lyrics and acoustics.
arXiv Detail & Related papers (2025-09-02T12:59:27Z) - X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image.<n>It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv Detail & Related papers (2025-01-17T08:10:53Z) - MikuDance: Animating Character Art with Mixed Motion Dynamics [28.189884806755153]
We propose MikuDance, a diffusion-based pipeline incorporating mixed motion dynamics to animate character art.
Specifically, a Scene Motion Tracking strategy is presented to explicitly model the dynamic camera in pixel-wise space, enabling unified character-scene motion modeling.
A Motion-Adaptive Normalization module is incorporated to effectively inject global scene motion, paving the way for comprehensive character art animation.
arXiv Detail & Related papers (2024-11-13T14:46:41Z) - Generating Human Interaction Motions in Scenes with Text Control [66.74298145999909]
We present TeSMo, a method for text-controlled scene-aware motion generation based on denoising diffusion models.
Our approach begins with pre-training a scene-agnostic text-to-motion diffusion model.
To facilitate training, we embed annotated navigation and interaction motions within scenes.
arXiv Detail & Related papers (2024-04-16T16:04:38Z) - MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method.
MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model.
During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z) - Speech-Driven 3D Face Animation with Composite and Regional Facial
Movements [30.348768852726295]
Speech-driven 3D face animation poses significant challenges due to the intricacy and variability inherent in human facial movements.
This paper emphasizes the importance of considering both the composite and regional natures of facial movements in speech-driven 3D face animation.
arXiv Detail & Related papers (2023-08-10T08:42:20Z) - Freeform Body Motion Generation from Speech [53.50388964591343]
Body motion generation from speech is inherently difficult due to the non-deterministic mapping from speech to body motions.
We introduce a novel freeform motion generation model (FreeMo) by equipping a two-stream architecture.
Experiments demonstrate the superior performance against several baselines.
arXiv Detail & Related papers (2022-03-04T13:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.