Related papers: Temporal Regularization Makes Your Video Generator Stronger

Related papers

All-in-One Video Restoration under Smoothly Evolving Unknown Weather Degradations [102.94052335735326]
All-in-one image restoration aims to recover clean images from diverse unknown degradations using a single model.<n>Existing approaches primarily focus on frame-wise degradation variation, overlooking the temporal continuity that naturally exists in real-world degradation processes.<n>We introduce the Smoothly Evolving Unknown Degradations (SEUD) scenario, where both the active degradation set and degradation intensity change continuously over time.
arXiv Detail & Related papers (2026-01-02T02:20:57Z)
STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows [35.05757953878183]
STARFlow-V is a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation.<n>Results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation.
arXiv Detail & Related papers (2025-11-25T16:27:58Z)
Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events [71.2439653098351]
Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
arXiv Detail & Related papers (2025-10-04T15:23:07Z)
WorldWeaver: Generating Long-Horizon Video Worlds via Rich Perception [40.96323549891244]
Current methods predominantly rely on RGB signals, leading to accumulated errors in object structure and motion over extended durations.<n>We introduce WorldWeaver, a robust framework for long video generation that jointly models RGB frames and perceptual conditions within a unified long-horizon modeling scheme.<n>Our training framework offers three key advantages. First, by jointly predicting perceptual conditions and color information from a unified representation, it significantly enhances temporal consistency and motion dynamics.
arXiv Detail & Related papers (2025-08-21T16:57:33Z)
STAGE: A Stream-Centric Generative World Model for Long-Horizon Driving-Scene Simulation [24.86836673853292]
STAGE is an auto-regressive framework that pioneers hierarchical feature coordination and multiphase optimization for sustainable video synthesis.<n>HTFT enhances temporal consistency between video frames throughout the video generation process.<n>We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods.
arXiv Detail & Related papers (2025-06-16T06:53:05Z)
Temporal In-Context Fine-Tuning for Versatile Control of Video Diffusion Models [34.131515004434846]
We introduce Temporal In-Context Fine-Tuning (TIC-FT), an efficient approach for adapting pretrained video diffusion models to conditional generation tasks.<n>TIC-FT requires no architectural changes and achieves strong performance with as few as 10-30 training samples.<n>We validate our method across a range of tasks, including image-to-video and video-to-video generation, using large-scale base models such as CogVideoX-5B and Wan-14B.
arXiv Detail & Related papers (2025-06-01T12:57:43Z)
Generative Pre-trained Autoregressive Diffusion Transformer [54.476056835275415]
GPDiT is a Generative Pre-trained Autoregressive Diffusion Transformer.<n>It unifies the strengths of diffusion and autoregressive modeling for long-range video synthesis.<n>It autoregressively predicts future latent frames using a diffusion loss, enabling natural modeling of motion dynamics.
arXiv Detail & Related papers (2025-05-12T08:32:39Z)
Raccoon: Multi-stage Diffusion Training with Coarse-to-Fine Curating Videos [15.781862060265519]
CFC-VIDS-1M is a high-quality video dataset constructed through a systematic coarse-to-fine curation pipeline. We develop RACCOON, a transformer-based architecture with decoupled spatial-temporal attention mechanisms.
arXiv Detail & Related papers (2025-02-28T18:56:35Z)
BF-STVSR: B-Splines and Fourier-Best Friends for High Fidelity Spatial-Temporal Video Super-Resolution [14.082598088990352]
We propose a C-STVSR framework with two key modules tailored to better represent spatial and temporal characteristics of video.<n>Our approach achieves state-of-the-art PSNR and SSIM performance, showing enhanced spatial details and natural temporal consistency.
arXiv Detail & Related papers (2025-01-19T13:29:41Z)
DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations [25.756755602342942]
We present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training.<n>Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead.
arXiv Detail & Related papers (2025-01-17T10:53:03Z)
ByTheWay: Boost Your Text-to-Video Generation Model to Higher Quality in a Training-free Way [72.1984861448374]
ByTheWay is a training-free method to improve the quality of text-to-video generation without introducing additional parameters, augmenting memory or sampling time.<n>It improves structural plausibility and temporal consistency of generated videos by reducing the disparity between the temporal attention maps across various decoder blocks.<n>It enhances the magnitude and richness of motion by amplifying the energy of the map.
arXiv Detail & Related papers (2024-10-08T17:56:33Z)
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach. We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z)
E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning [53.63364311738552]
Bio-inspired event cameras or dynamic vision sensors are capable of capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. We propose textbfE2HQV, a novel E2V paradigm designed to produce high-quality video frames from events.
arXiv Detail & Related papers (2024-01-16T05:10:50Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation [49.298187741014345]
Current methods intertwine spatial content and temporal dynamics together, leading to an increased complexity of text-to-video generation (T2V) We propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives.
arXiv Detail & Related papers (2023-12-07T17:59:07Z)
StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation [73.54398908446906]
We introduce a novel motion generator design that uses a learning-based inversion network for GAN. Our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator.
arXiv Detail & Related papers (2023-08-31T17:59:33Z)
Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better. Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs. Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z)
Exploring Temporal Coherence for More General Video Face Forgery Detection [22.003901822221227]
We propose a novel end-to-end framework, which consists of two major stages. The first stage is a fully temporal convolution network (FTCN), while maintaining the temporal convolution kernel size unchanged. The second stage is a Temporal Transformer network, which aims to explore the long-term temporal coherence.
arXiv Detail & Related papers (2021-08-15T08:45:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.