Self-Refining Video Sampling
- URL: http://arxiv.org/abs/2601.18577v1
- Date: Mon, 26 Jan 2026 15:22:27 GMT
- Title: Self-Refining Video Sampling
- Authors: Sangwon Jang, Taekyung Ki, Jaehyeong Jo, Saining Xie, Jaehong Yoon, Sung Ju Hwang,
- Abstract summary: We present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner.<n> Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment.
- Score: 91.0784344916165
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements in motion coherence and physics alignment, achieving over 70\% human preference compared to the default sampler and guidance-based sampler.
Related papers
- LSA: Localized Semantic Alignment for Enhancing Temporal Consistency in Traffic Video Generation [44.62533878314138]
Localized Semantic Alignment (LSA) is a framework for fine-tuning pre-trained video generation models.<n>LSA enhances temporal consistency by aligning semantic features between ground-truth and generated video clips.<n> experiments on nuScenes and KITTI datasets show the effectiveness of our approach.
arXiv Detail & Related papers (2026-02-05T18:21:02Z) - Video Prediction of Dynamic Physical Simulations With Pixel-Space Spatiotemporal Transformers [3.951575888190684]
This study investigates a transformer adaptation for video prediction with a simple end-to-end approach, comparing various selftemporal-attention layouts.<n>We introduce a simple yet effective transformer for autoregressive video prediction, utilizing continuous pixel-space representations for video prediction horizon.
arXiv Detail & Related papers (2025-10-23T17:58:45Z) - Training-Free Motion Customization for Distilled Video Generators with Adaptive Test-Time Distillation [53.877572078307935]
Distilled video generation models offer fast and efficient but struggle with motion customization when guided by reference videos.<n>We propose MotionEcho, a training-free test-time distillation framework that enables motion customization by leveraging diffusion teacher forcing.
arXiv Detail & Related papers (2025-06-24T06:20:15Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Refining Pre-Trained Motion Models [56.18044168821188]
We take on the challenge of improving state-of-the-art supervised models with self-supervised training.
We focus on obtaining a "clean" training signal from real-world unlabelled video.
We show that our method yields reliable gains over fully-supervised methods in real videos.
arXiv Detail & Related papers (2024-01-01T18:59:33Z) - StyleInV: A Temporal Style Modulated Inversion Network for Unconditional
Video Generation [73.54398908446906]
We introduce a novel motion generator design that uses a learning-based inversion network for GAN.
Our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator.
arXiv Detail & Related papers (2023-08-31T17:59:33Z) - Autoencoding Video Latents for Adversarial Video Generation [0.0]
AVLAE is a two stream latent autoencoder where the video distribution is learned by adversarial training.
We demonstrate that our approach learns to disentangle motion and appearance codes even without the explicit structural composition in the generator.
arXiv Detail & Related papers (2022-01-18T11:42:14Z) - CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones.
It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.