Related papers: Enhanced Fine-grained Motion Diffusion for Text-driven Human Motion Synthesis

Enhanced Fine-grained Motion Diffusion for Text-driven Human Motion Synthesis

URL: http://arxiv.org/abs/2305.13773v2
Date: Sun, 24 Dec 2023 06:54:46 GMT
Title: Enhanced Fine-grained Motion Diffusion for Text-driven Human Motion Synthesis
Authors: Dong Wei, Xiaoning Sun, Huaijiang Sun, Bin Li, Shengxiang Hu, Weiqing Li, Jianfeng Lu
Abstract summary: We propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated. Our model achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor.
Score: 21.57205701909026
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The emergence of text-driven motion synthesis technique provides animators with great potential to create efficiently. However, in most cases, textual expressions only contain general and qualitative motion descriptions, while lack fine depiction and sufficient intensity, leading to the synthesized motions that either (a) semantically compliant but uncontrollable over specific pose details, or (b) even deviates from the provided descriptions, bringing animators with undesired cases. In this paper, we propose DiffKFC, a conditional diffusion model for text-driven motion synthesis with KeyFrames Collaborated, enabling realistic generation with collaborative and efficient dual-level control: coarse guidance at semantic level, with only few keyframes for direct and fine-grained depiction down to body posture level. Unlike existing inference-editing diffusion models that incorporate conditions without training, our conditional diffusion model is explicitly trained and can fully exploit correlations among texts, keyframes and the diffused target frames. To preserve the control capability of discrete and sparse keyframes, we customize dilated mask attention modules where only partial valid tokens participate in local-to-global attention, indicated by the dilated keyframe mask. Additionally, we develop a simple yet effective smoothness prior, which steers the generated frames towards seamless keyframe transitions at inference. Extensive experiments show that our model not only achieves state-of-the-art performance in terms of semantic fidelity, but more importantly, is able to satisfy animator requirements through fine-grained guidance without tedious labor.

Related papers

Towards Robust and Controllable Text-to-Motion via Masked Autoregressive Diffusion [33.9786226622757]
We propose a robust motion generation framework MoMADiff to generate 3D human motion from text descriptions.<n>Our model supports flexible user-provided specification, enabling precise control over both spatial and temporal aspects of motion synthesis.<n>Our method consistently outperforms state-of-the-art models in motion quality, instruction fidelity, and adherence.
arXiv Detail & Related papers (2025-05-16T09:06:15Z)
Less is More: Improving Motion Diffusion Models with Sparse Keyframes [21.48244441857993]
We propose a novel diffusion framework explicitly designed around sparse and geometrically meaningfuls. Our method reduces by masking non-keyframes and efficiently interpolating missing frames. Our approach consistently outperforms state-of-the-art methods in text alignment and motion realism.
arXiv Detail & Related papers (2025-03-18T03:20:02Z)
SALAD: Skeleton-aware Latent Diffusion for Text-driven Motion Generation and Editing [5.123822132804602]
We introduce a skeleton-aware latent diffusion (SALAD) model that captures the intricate inter-relationships between joints, frames, and words. By leveraging cross-attention maps produced during the generation process, we enable attention-based zero-shot text-driven motion editing. Our approach significantly outperforms previous methods in terms of text-motion alignment without compromising generation quality.
arXiv Detail & Related papers (2025-03-18T02:20:11Z)
Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis [27.43583075023949]
Ditto is a diffusion-based talking head framework that enables fine-grained controls and real-time inference. We show that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.
arXiv Detail & Related papers (2024-11-29T07:01:31Z)
KinMo: Kinematic-aware Human Motion Understanding and Generation [6.962697597686156]
Controlling human motion based on text presents an important challenge in computer vision. Traditional approaches often rely on holistic action descriptions for motion synthesis. We propose a novel motion representation that decomposes motion into distinct body joint group movements.
arXiv Detail & Related papers (2024-11-23T06:50:11Z)
Flexible Motion In-betweening with Diffusion Models [16.295323675781184]
We investigate the potential of diffusion models in generating diverse human motions guided by compares. Unlike previous inbetweening methods, we propose a simple unified model capable of generating precise and diverse motions. We evaluate the performance of CondMDI on the text-conditioned HumanML3D dataset.
arXiv Detail & Related papers (2024-05-17T23:55:51Z)
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
Act As You Wish: Fine-Grained Control of Motion Diffusion Model with Hierarchical Semantic Graphs [31.244039305932287]
We propose hierarchical semantic graphs for fine-grained control over motion generation. We disentangle motion descriptions into hierarchical semantic graphs including three levels of motions, actions, and specifics. Our method can continuously refine the generated motion, which may have a far-reaching impact on the community.
arXiv Detail & Related papers (2023-11-02T06:20:23Z)
SemanticBoost: Elevating Motion Generation with Augmented Textual Cues [73.83255805408126]
Our framework comprises a Semantic Enhancement module and a Context-Attuned Motion Denoiser (CAMD) The CAMD approach provides an all-encompassing solution for generating high-quality, semantically consistent motion sequences. Our experimental results demonstrate that SemanticBoost, as a diffusion-based method, outperforms auto-regressive-based techniques.
arXiv Detail & Related papers (2023-10-31T09:58:11Z)
Progressive Text-to-Image Diffusion with Soft Latent Direction [17.120153452025995]
This paper introduces an innovative progressive synthesis and editing operation that systematically incorporates entities into the target image. Our proposed framework yields notable advancements in object synthesis, particularly when confronted with intricate and lengthy textual inputs.
arXiv Detail & Related papers (2023-09-18T04:01:25Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)
Text-driven Video Prediction [83.04845684117835]
We propose a new task called Text-driven Video Prediction (TVP) Taking the first frame and text caption as inputs, this task aims to synthesize the following frames. To investigate the capability of text in causal inference for progressive motion information, our TVP framework contains a Text Inference Module (TIM)
arXiv Detail & Related papers (2022-10-06T12:43:07Z)
MoDi: Unconditional Motion Synthesis from Diverse Data [51.676055380546494]
We present MoDi, an unconditional generative model that synthesizes diverse motions. Our model is trained in a completely unsupervised setting from a diverse, unstructured and unlabeled motion dataset. We show that despite the lack of any structure in the dataset, the latent space can be semantically clustered.
arXiv Detail & Related papers (2022-06-16T09:06:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.