Related papers: Video Diffusion Models are Training-free Motion Interpreter and Controller

Video Diffusion Models are Training-free Motion Interpreter and Controller

URL: http://arxiv.org/abs/2405.14864v3
Date: Tue, 12 Nov 2024 01:31:41 GMT
Title: Video Diffusion Models are Training-free Motion Interpreter and Controller
Authors: Zeqi Xiao, Yifan Zhou, Shuai Yang, Xingang Pan,
Abstract summary: This paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels.
Score: 20.361790608772157
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Video generation primarily aims to model authentic and customized motion across frames, making understanding and controlling the motion a crucial topic. Most diffusion-based studies on video motion focus on motion customization with training-based paradigms, which, however, demands substantial training resources and necessitates retraining for diverse models. Crucially, these approaches do not explore how video diffusion models encode cross-frame motion information in their features, lacking interpretability and transparency in their effectiveness. To answer this question, this paper introduces a novel perspective to understand, localize, and manipulate motion-aware features in video diffusion models. Through analysis using Principal Component Analysis (PCA), our work discloses that robust motion-aware feature already exists in video diffusion models. We present a new MOtion FeaTure (MOFT) by eliminating content correlation information and filtering motion channels. MOFT provides a distinct set of benefits, including the ability to encode comprehensive motion information with clear interpretability, extraction without the need for training, and generalizability across diverse architectures. Leveraging MOFT, we propose a novel training-free video motion control framework. Our method demonstrates competitive performance in generating natural and faithful motion, providing architecture-agnostic insights and applicability in a variety of downstream tasks.

Related papers

Moaw: Unleashing Motion Awareness for Video Diffusion Models [71.34328578845721]
Moaw is a framework that unleashes motion awareness for video diffusion models.<n>We train a diffusion model for motion perception, shifting its modality from image-to-video generation to video-to-dense-tracking.<n>We then construct a motion-labeled dataset to identify features that encode the strongest motion information, and inject them into a structurally identical video generation model.
arXiv Detail & Related papers (2026-01-19T06:45:46Z)
Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video [38.71994714429696]
We propose a novel and general framework to disentangle video data into its dynamic motion and static content components.<n>Our proposed method is a self-supervised pipeline with less assumptions and inductive biases than previous works.<n>We validate our disentangled representation learning framework on real-world talking head videos with motion transfer and auto-regressive motion generation tasks.
arXiv Detail & Related papers (2025-09-10T08:14:45Z)
SynMotion: Semantic-Visual Adaptation for Motion Customized Video Generation [56.90807453045657]
SynMotion is a motion-customized video generation model that jointly leverages semantic guidance and visual adaptation.<n>At the semantic level, we introduce the dual-em semantic comprehension mechanism which disentangles subject and motion representations.<n>At the visual level, we integrate efficient motion adapters into a pre-trained video generation model to enhance motion fidelity and temporal coherence.
arXiv Detail & Related papers (2025-06-30T10:09:32Z)
Self-Supervised Learning of Motion Concepts by Optimizing Counterfactuals [13.202236467650033]
Estimating motion in videos is an essential computer vision problem with many downstream applications. We develop Opt-CWM, a self-supervised technique for flow and occlusion estimation from a pre-trained next-frame prediction model. We achieve state-of-the-art performance for motion estimation on real-world videos while requiring no labeled data.
arXiv Detail & Related papers (2025-03-25T17:58:52Z)
MotionMatcher: Motion Customization of Text-to-Video Diffusion Models via Motion Feature Matching [27.28898943916193]
Text-to-video (T2V) diffusion models have promising capabilities in synthesizing realistic videos from input text prompts. In this work, we tackle the motion customization problem, where a reference video is provided as motion guidance. We propose MotionMatcher, a motion customization framework that fine-tunes the pre-trained T2V diffusion model at the feature level.
arXiv Detail & Related papers (2025-02-18T19:12:51Z)
VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models [71.9811050853964]
VideoJAM is a novel framework that instills an effective motion prior to video generators. VideoJAM achieves state-of-the-art performance in motion coherence. These findings emphasize that appearance and motion can be complementary and, when effectively integrated, enhance both the visual quality and the coherence of video generation.
arXiv Detail & Related papers (2025-02-04T17:07:10Z)
MotionFlow: Attention-Driven Motion Transfer in Video Diffusion Models [3.2311303453753033]
We introduce MotionFlow, a novel framework designed for motion transfer in video diffusion models. Our method utilizes cross-attention maps to accurately capture and manipulate spatial and temporal dynamics. Our experiments demonstrate that MotionFlow significantly outperforms existing models in both fidelity and versatility even during drastic scene alterations.
arXiv Detail & Related papers (2024-12-06T18:59:12Z)
MoTrans: Customized Motion Transfer with Text-driven Video Diffusion Models [59.10171699717122]
MoTrans is a customized motion transfer method enabling video generation of similar motion in new context. multimodal representations from recaptioned prompt and video frames promote the modeling of appearance. Our method effectively learns specific motion pattern from singular or multiple reference videos.
arXiv Detail & Related papers (2024-12-02T10:07:59Z)
MotionCom: Automatic and Motion-Aware Image Composition with LLM and Video Diffusion Prior [51.672193627686]
MotionCom is a training-free motion-aware diffusion based image composition. It enables seamless integration of target objects into new scenes with dynamically coherent results.
arXiv Detail & Related papers (2024-09-16T08:44:17Z)
Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms. SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics. Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z)
Animate Your Motion: Turning Still Images into Dynamic Videos [58.63109848837741]
We introduce Scene and Motion Conditional Diffusion (SMCD), a novel methodology for managing multimodal inputs. SMCD incorporates a recognized motion conditioning module and investigates various approaches to integrate scene conditions. Our design significantly enhances video quality, motion precision, and semantic coherence.
arXiv Detail & Related papers (2024-03-15T10:36:24Z)
Customizing Motion in Text-to-Video Diffusion Models [79.4121510826141]
We introduce an approach for augmenting text-to-video generation models with customized motions. By leveraging a few video samples demonstrating specific movements as input, our method learns and generalizes the input motion patterns for diverse, text-specified scenarios.
arXiv Detail & Related papers (2023-12-07T18:59:03Z)
TrackDiffusion: Tracklet-Conditioned Video Generation via Diffusion Models [75.20168902300166]
We propose TrackDiffusion, a novel video generation framework affording fine-grained trajectory-conditioned motion control. A pivotal component of TrackDiffusion is the instance enhancer, which explicitly ensures inter-frame consistency of multiple objects. generated video sequences by our TrackDiffusion can be used as training data for visual perception models.
arXiv Detail & Related papers (2023-12-01T15:24:38Z)
Motion-Conditioned Diffusion Model for Controllable Video Synthesis [75.367816656045]
We introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes. We show that MCDiff achieves the state-the-art visual quality in stroke-guided controllable video synthesis.
arXiv Detail & Related papers (2023-04-27T17:59:32Z)
Self-Supervised Video Representation Learning with Motion-Contrastive Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet) MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP) Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z)
MotionSqueeze: Neural Motion Feature Learning for Video Understanding [46.82376603090792]
Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost.
arXiv Detail & Related papers (2020-07-20T08:30:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.