MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
- URL: http://arxiv.org/abs/2511.21592v1
- Date: Wed, 26 Nov 2025 17:09:03 GMT
- Title: MoGAN: Improving Motion Quality in Video Diffusion via Few-Step Motion Adversarial Post-Training
- Authors: Haotian Xue, Qi Chen, Zhonghao Wang, Xun Huang, Eli Shechtman, Jinrong Xie, Yongxin Chen,
- Abstract summary: Video diffusion models achieve strong frame-level fidelity but struggle with motion coherence, dynamics and realism.<n>We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data.
- Score: 46.09617860476419
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Video diffusion models achieve strong frame-level fidelity but still struggle with motion coherence, dynamics and realism, often producing jitter, ghosting, or implausible dynamics. A key limitation is that the standard denoising MSE objective provides no direct supervision on temporal consistency, allowing models to achieve low loss while still generating poor motion. We propose MoGAN, a motion-centric post-training framework that improves motion realism without reward models or human preference data. Built atop a 3-step distilled video diffusion model, we train a DiT-based optical-flow discriminator to differentiate real from generated motion, combined with a distribution-matching regularizer to preserve visual fidelity. With experiments on Wan2.1-T2V-1.3B, MoGAN substantially improves motion quality across benchmarks. On VBench, MoGAN boosts motion score by +7.3% over the 50-step teacher and +13.3% over the 3-step DMD model. On VideoJAM-Bench, MoGAN improves motion score by +7.4% over the teacher and +8.8% over DMD, while maintaining comparable or even better aesthetic and image-quality scores. A human study further confirms that MoGAN is preferred for motion quality (52% vs. 38% for the teacher; 56% vs. 29% for DMD). Overall, MoGAN delivers significantly more realistic motion without sacrificing visual fidelity or efficiency, offering a practical path toward fast, high-quality video generation. Project webpage is: https://xavihart.github.io/mogan.
Related papers
- Motion Attribution for Video Generation [97.2515042185441]
We present Motive, a motion-centric, gradient-based data attribution framework.<n>We use it to study which fine-tuning clips improve or degrade temporal dynamics.<n>To our knowledge, this is the first framework to attribute motion rather than visual appearance in video generative models.
arXiv Detail & Related papers (2026-01-13T18:59:09Z) - Real-Time Motion-Controllable Autoregressive Video Diffusion [79.32730467857535]
We propose AR-Drag, the first RL-enhanced few-step AR video diffusion model for real-time image-to-video generation with diverse motion control.<n>We first fine-tune a base I2V model to support basic motion control, then further improve it via reinforcement with a trajectory-based reward model.<n>Our design preserves the Markov property through a Self-Rollout learning mechanism and accelerates training by selectively denoising steps.
arXiv Detail & Related papers (2025-10-09T12:17:11Z) - VimoRAG: Video-based Retrieval-augmented 3D Motion Generation for Motion Language Models [110.32291962407078]
VimoRAG is a video-based retrieval-augmented motion generation framework for motion large language models.<n>We develop an effective motion-centered video retrieval model and mitigate the issue of error propagation caused by suboptimal retrieval results.<n> Experimental results show that VimoRAG significantly boosts the performance of motion LLMs constrained to text-only input.
arXiv Detail & Related papers (2025-08-16T15:31:14Z) - Physics-Guided Motion Loss for Video Generation Model [8.083315267770255]
Current video diffusion models generate visually compelling content but often violate basic laws of physics.<n>We introduce a frequency-domain physics prior that improves motion plausibility without modifying model architectures.
arXiv Detail & Related papers (2025-06-02T20:42:54Z) - Teller: Real-Time Streaming Audio-Driven Portrait Animation with Autoregressive Motion Generation [18.45773436423025]
We introduce the first autoregressive framework for real-time, audio-driven portrait animation, a.k.a, talking head.<n>We propose Teller, the first streaming audio-driven protrait animation framework with autoregressive motion generation.
arXiv Detail & Related papers (2025-03-24T08:16:47Z) - MotionStone: Decoupled Motion Intensity Modulation with Diffusion Transformer for Image-to-Video Generation [55.238542326124545]
Image-to-video (I2V) generation is conditioned on the static image, which has been enhanced recently by the motion intensity as an additional control signal.<n>These motion-aware models are appealing to generate diverse motion patterns, yet there lacks a reliable motion estimator for training such models on large-scale video set in the wild.<n>This paper addresses the challenge with a new motion estimator, capable of measuring the decoupled motion intensities of objects and cameras in video.
arXiv Detail & Related papers (2024-12-08T08:12:37Z) - Spectral Motion Alignment for Video Motion Transfer using Diffusion Models [54.32923808964701]
Spectral Motion Alignment (SMA) is a framework that refines and aligns motion vectors using Fourier and wavelet transforms.<n> SMA learns motion patterns by incorporating frequency-domain regularization, facilitating the learning of whole-frame global motion dynamics.<n>Extensive experiments demonstrate SMA's efficacy in improving motion transfer while maintaining computational efficiency and compatibility across various video customization frameworks.
arXiv Detail & Related papers (2024-03-22T14:47:18Z) - MotionMix: Weakly-Supervised Diffusion for Controllable Motion
Generation [19.999239668765885]
MotionMix is a weakly-supervised diffusion model that leverages both noisy and unannotated motion sequences.
Our framework consistently achieves state-of-the-art performances on text-to-motion, action-to-motion, and music-to-dance tasks.
arXiv Detail & Related papers (2024-01-20T04:58:06Z) - Motion Sensitive Contrastive Learning for Self-supervised Video
Representation [34.854431881562576]
Motion Sensitive Contrastive Learning (MSCL) injects the motion information captured by optical flows into RGB frames to strengthen feature learning.
Local Motion Contrastive Learning (LMCL) with frame-level contrastive objectives across the two modalities.
Flow Rotation Augmentation (FRA) to generate extra motion-shuffled negative samples and Motion Differential Sampling (MDS) to accurately screen training samples.
arXiv Detail & Related papers (2022-08-12T04:06:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.