Related papers: Characterizing Motion Encoding in Video Diffusion Timesteps

Characterizing Motion Encoding in Video Diffusion Timesteps

URL: http://arxiv.org/abs/2512.22175v1
Date: Thu, 18 Dec 2025 21:20:54 GMT
Title: Characterizing Motion Encoding in Video Diffusion Timesteps
Authors: Vatsal Baherwani, Yixuan Ren, Abhinav Shrivastava,
Abstract summary: We study how motion is encoded in video diffusion timesteps by the trade-off between appearance editing and motion preservation.<n>We identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space.
Score: 50.13907856401258
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-video diffusion models synthesize temporal motion and spatial appearance through iterative denoising, yet how motion is encoded across timesteps remains poorly understood. Practitioners often exploit the empirical heuristic that early timesteps mainly shape motion and layout while later ones refine appearance, but this behavior has not been systematically characterized. In this work, we proxy motion encoding in video diffusion timesteps by the trade-off between appearance editing and motion preservation induced when injecting new conditions over specified timestep ranges, and characterize this proxy through a large-scale quantitative study. This protocol allows us to factor motion from appearance by quantitatively mapping how they compete along the denoising trajectory. Across diverse architectures, we consistently identify an early, motion-dominant regime and a later, appearance-dominant regime, yielding an operational motion-appearance boundary in timestep space. Building on this characterization, we simplify current one-shot motion customization paradigm by restricting training and inference to the motion-dominant regime, achieving strong motion transfer without auxiliary debiasing modules or specialized objectives. Our analysis turns a widely used heuristic into a spatiotemporal disentanglement principle, and our timestep-constrained recipe can serve as ready integration into existing motion transfer and editing methods.

Related papers

Towards Arbitrary Motion Completing via Hierarchical Continuous Representation [64.6525112550758]
We propose a novel parametric activation-induced hierarchical implicit representation framework, called NAME, based on Implicit Representations (INRs)<n>Our method introduces a hierarchical temporal encoding mechanism that extracts features from motion sequences at multiple temporal scales, enabling effective capture of intricate temporal patterns.
arXiv Detail & Related papers (2025-12-24T14:07:04Z)
FunPhase: A Periodic Functional Autoencoder for Motion Generation via Phase Manifolds [2.6041136107390037]
We introduce FunPhase, a functional periodic autoencoder that learns a phase manifold for motion and replaces discrete temporal decoding with a function-space formulation.<n>FunPhase supports downstream tasks such as super-resolution and partial-body motion completion, generalizes across skeletons and datasets, and unifies motion prediction and generation within a single interpretable manifold.
arXiv Detail & Related papers (2025-12-10T08:46:53Z)
FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation [51.110607281391154]
FlowMo is a training-free guidance method for enhancing motion coherence in text-to-video models.<n>It estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling.
arXiv Detail & Related papers (2025-06-01T19:55:33Z)
REWIND: Real-Time Egocentric Whole-Body Motion Diffusion with Exemplar-Based Identity Conditioning [95.07708090428814]
We present REWIND, a one-step diffusion model for real-time, high-fidelity human motion estimation from egocentric image inputs.<n>We introduce cascaded body-hand denoising diffusion, which effectively models the correlation between egocentric body and hand motions.<n>We also propose a novel identity conditioning method based on a small set of pose exemplars of the target identity, which further enhances motion estimation quality.
arXiv Detail & Related papers (2025-04-07T11:44:11Z)
Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z)
Training-Free Motion-Guided Video Generation with Enhanced Temporal Consistency Using Motion Consistency Loss [35.69606926024434]
We propose a simple yet effective solution that combines an initial-noise-based approach with a novel motion consistency loss.<n>We then design a motion consistency loss to maintain similar feature correlation patterns in the generated video.<n>This approach improves temporal consistency across various motion control tasks while preserving the benefits of a training-free setup.
arXiv Detail & Related papers (2025-01-13T18:53:08Z)
Motion Inversion for Video Customization [31.607669029754874]
We present a novel approach for motion in generation, addressing the widespread gap in the exploration of motion representation within video models. We introduce Motion Embeddings, a set of explicit, temporally coherent embeddings derived from given video. Our contributions include a tailored motion embedding for customization tasks and a demonstration of the practical advantages and effectiveness of our method.
arXiv Detail & Related papers (2024-03-29T14:14:22Z)
DiffusionPhase: Motion Diffusion in Frequency Domain [69.811762407278]
We introduce a learning-based method for generating high-quality human motion sequences from text descriptions. Existing techniques struggle with motion diversity and smooth transitions in generating arbitrary-length motion sequences. We develop a network encoder that converts the motion space into a compact yet expressive parameterized phase space.
arXiv Detail & Related papers (2023-12-07T04:39:22Z)
Learning Self-Similarity in Space and Time as Generalized Motion for Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS) We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.