Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization
- URL: http://arxiv.org/abs/2511.18255v2
- Date: Wed, 26 Nov 2025 16:28:59 GMT
- Title: Sequence-Adaptive Video Prediction in Continuous Streams using Diffusion Noise Optimization
- Authors: Sina Mokhtarzadeh Azar, Emad Bahrami, Enrico Pallotta, Gianpiero Francesca, Radu Timofte, Juergen Gall,
- Abstract summary: We propose an approach that continuously adapts a pre-trained diffusion model to a video stream.<n>We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO)<n> Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube.
- Score: 63.37868191173104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we investigate diffusion-based video prediction models, which forecast future video frames, for continuous video streams. In this context, the models observe continuously new training samples, and we aim to leverage this to improve their predictions. We thus propose an approach that continuously adapts a pre-trained diffusion model to a video stream. Since fine-tuning the parameters of a large diffusion model is too expensive, we refine the diffusion noise during inference while keeping the model parameters frozen, allowing the model to adaptively determine suitable sampling noise. We term the approach Sequence Adaptive Video Prediction with Diffusion Noise Optimization (SAVi-DNO). To validate our approach, we introduce a new evaluation setting on the Ego4D dataset, focusing on simultaneous adaptation and evaluation on long continuous videos. Empirical results demonstrate improved performance based on FVD, SSIM, and PSNR metrics on long videos of Ego4D and OpenDV-YouTube, as well as videos of UCF-101 and SkyTimelapse, showcasing SAVi-DNO's effectiveness.
Related papers
- Diffusion-DRF: Differentiable Reward Flow for Video Diffusion Fine-Tuning [72.16213872139748]
Diffusion-DRF is a differentiable reward flow for fine-tuning video diffusion models.<n>It backpropagates VLM feedback through the diffusion denoising chain.<n>It improves video quality and semantic alignment while mitigating reward hacking and collapse.
arXiv Detail & Related papers (2026-01-07T18:05:08Z) - FlowMo: Variance-Based Flow Guidance for Coherent Motion in Video Generation [51.110607281391154]
FlowMo is a training-free guidance method for enhancing motion coherence in text-to-video models.<n>It estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling.
arXiv Detail & Related papers (2025-06-01T19:55:33Z) - AccVideo: Accelerating Video Diffusion Model with Synthetic Dataset [55.82208863521353]
We propose AccVideo to reduce the inference steps for accelerating video diffusion models with synthetic dataset.<n>Our model achieves 8.5x improvements in generation speed compared to the teacher model.<n>Compared to previous accelerating methods, our approach is capable of generating videos with higher quality and resolution.
arXiv Detail & Related papers (2025-03-25T08:52:07Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z) - Diffusion-based Unsupervised Audio-visual Speech Enhancement [26.937216751657697]
This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach.<n>It combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model.<n> Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method.
arXiv Detail & Related papers (2024-10-04T12:22:54Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - Exploring Iterative Refinement with Diffusion Models for Video Grounding [17.435735275438923]
Video grounding aims to localize the target moment in an untrimmed video corresponding to a given sentence query.
We propose DiffusionVG, a novel framework with diffusion models that formulates video grounding as a conditional generation task.
arXiv Detail & Related papers (2023-10-26T07:04:44Z) - APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency [9.07931905323022]
We propose a novel text-to-video (T2V) generation network structure based on diffusion models.
Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks.
We leverage a hybrid architecture of transformers and convolutions to compensate for temporal intricacies, enhancing consistency between different frames within the video.
arXiv Detail & Related papers (2023-08-24T07:11:00Z) - Diffusion Probabilistic Modeling for Video Generation [17.48026395867434]
Denoising diffusion probabilistic models are a promising new class of generative models that are competitive with GANs on perceptual metrics.
Inspired by recent advances in neural video compression, we use denoising diffusion models to generate a residual baseline to a deterministic next-frame prediction.
We find significant improvements in terms of perceptual quality on all data and improvements in terms of frame forecasting for complex high-resolution videos.
arXiv Detail & Related papers (2022-03-16T03:52:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.