Related papers: FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion

FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion

URL: http://arxiv.org/abs/2601.03959v1
Date: Wed, 07 Jan 2026 14:18:59 GMT
Title: FUSION: Full-Body Unified Motion Prior for Body and Hands via Diffusion
Authors: Enes Duran, Nikos Athanasiou, Muhammed Kocabas, Michael J. Black, Omid Taheri,
Abstract summary: Hands are central to interacting with our surroundings and conveying gestures.<n>Existing human motion synthesis methods fall short.<n>Key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion.
Score: 49.026972478098266
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Hands are central to interacting with our surroundings and conveying gestures, making their inclusion essential for full-body motion synthesis. Despite this, existing human motion synthesis methods fall short: some ignore hand motions entirely, while others generate full-body motions only for narrowly scoped tasks under highly constrained settings. A key obstacle is the lack of large-scale datasets that jointly capture diverse full-body motion with detailed hand articulation. While some datasets capture both, they are limited in scale and diversity. Conversely, large-scale datasets typically focus either on body motion without hands or on hand motions without the body. To overcome this, we curate and unify existing hand motion datasets with large-scale body motion data to generate full-body sequences that capture both hand and body. We then propose the first diffusion-based unconditional full-body motion prior, FUSION, which jointly models body and hand motion. Despite using a pose-based motion representation, FUSION surpasses state-of-the-art skeletal control models on the Keypoint Tracking task in the HumanML3D dataset and achieves superior motion naturalness. Beyond standard benchmarks, we demonstrate that FUSION can go beyond typical uses of motion priors through two applications: (1) generating detailed full-body motion including fingers during interaction given the motion of an object, and (2) generating Self-Interaction motions using an LLM to transform natural language cues into actionable motion constraints. For these applications, we develop an optimization pipeline that refines the latent space of our diffusion model to generate task-specific motions. Experiments on these tasks highlight precise control over hand motion while maintaining plausible full-body coordination. The code will be public.

Related papers

CoDA: Coordinated Diffusion Noise Optimization for Whole-Body Manipulation of Articulated Objects [14.230098033626744]
Whole-body manipulation of articulated objects is a critical yet challenging task with broad applications in virtual humans and robotics.<n>We propose a novel coordinated diffusion noise optimization framework to achieve realistic whole-body motion.<n>We conduct extensive experiments demonstrating that our method outperforms existing approaches in motion quality and physical plausibility.
arXiv Detail & Related papers (2025-05-27T17:11:50Z)
Diffgrasp: Whole-Body Grasping Synthesis Guided by Object Motion Using a Diffusion Model [25.00532805042292]
We propose a simple yet effective framework that jointly models the relationship between the body, hands, and the given object motion sequences.<n>We introduce novel contact-aware losses and incorporate a data-driven, carefully designed guidance.<n> Experimental results demonstrate that our approach outperforms the state-of-the-art method and generates plausible whole-body motion sequences.
arXiv Detail & Related papers (2024-12-30T02:21:43Z)
FreeMotion: MoCap-Free Human Motion Synthesis with Multimodal Large Language Models [19.09048969615117]
We explore open-set human motion synthesis using natural language instructions as user control signals based on MLLMs. Our method can achieve general human motion synthesis for many downstream tasks.
arXiv Detail & Related papers (2024-06-15T21:10:37Z)
FreeMotion: A Unified Framework for Number-free Text-to-Motion Synthesis [65.85686550683806]
This paper reconsiders motion generation and proposes to unify the single and multi-person motion by the conditional motion distribution. Based on our framework, the current single-person motion spatial control method could be seamlessly integrated, achieving precise control of multi-person motion.
arXiv Detail & Related papers (2024-05-24T17:57:57Z)
Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z)
Object Motion Guided Human Motion Synthesis [22.08240141115053]
We study the problem of full-body human motion synthesis for the manipulation of large-sized objects. We propose Object MOtion guided human MOtion synthesis (OMOMO), a conditional diffusion framework. We develop a novel system that captures full-body human manipulation motions by simply attaching a smartphone to the object being manipulated.
arXiv Detail & Related papers (2023-09-28T08:22:00Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency [57.9920824261925]
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. modeling realistic hand-object interactions is critical for applications in computer graphics, computer vision, and mixed reality. GRIP is a learning-based method that takes as input the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction.
arXiv Detail & Related papers (2023-08-22T17:59:51Z)
Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations [61.659439423703155]
TOHO: Task-Oriented Human-Object Interactions Generation with Implicit Neural Representations. Our method generates continuous motions that are parameterized only by the temporal coordinate. This work takes a step further toward general human-scene interaction simulation.
arXiv Detail & Related papers (2023-03-23T09:31:56Z)
Task-Generic Hierarchical Human Motion Prior using VAEs [44.356707509079044]
A deep generative model that describes human motions can benefit a wide range of fundamental computer vision and graphics tasks. We present a method for learning complex human motions independent of specific tasks using a combined global and local latent space. We demonstrate the effectiveness of our hierarchical motion variational autoencoder in a variety of tasks including video-based human pose estimation.
arXiv Detail & Related papers (2021-06-07T23:11:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.