Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events
- URL: http://arxiv.org/abs/2510.03833v1
- Date: Sat, 04 Oct 2025 15:23:07 GMT
- Title: Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events
- Authors: Shuoyan Wei, Feng Li, Shengeng Tang, Runmin Cong, Yao Zhao, Meng Wang, Huihui Bai,
- Abstract summary: Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
- Score: 71.2439653098351
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Continuous space-time video super-resolution (C-STVSR) has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary spatial and temporal scales. However, prevailing methods often generalize poorly, producing unsatisfactory results when applied to out-of-distribution (OOD) scales. To overcome this limitation, we present EvEnhancer, a novel approach that marries the unique properties of high temporal resolution and high dynamic range encapsulated in event streams to achieve robust and generalizable C-STVSR. Our approach incorporates event-adapted synthesis that capitalizes on the spatiotemporal correlations between frames and events to capture long-term motion trajectories, enabling adaptive interpolation and fusion across space and time. This is then coupled with a local implicit video transformer that integrates local implicit video neural function with cross-scale spatiotemporal attention to learn continuous video representations and generate plausible videos at arbitrary resolutions and frame rates. We further develop EvEnhancerPlus, which builds a controllable switching mechanism that dynamically determines the reconstruction difficulty for each spatiotemporal pixel based on local event statistics. This allows the model to adaptively route reconstruction along the most suitable pathways at a fine-grained pixel level, substantially reducing computational overhead while maintaining excellent performance. Furthermore, we devise a cross-derivative training strategy that stabilizes the convergence of such a multi-pathway framework through staged cross-optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining superior generalizability at OOD scales. The code is available at https://github.com/W-Shuoyan/EvEnhancerPlus.
Related papers
- UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models [54.564740558030245]
We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
arXiv Detail & Related papers (2026-02-26T12:54:46Z) - OSDEnhancer: Taming Real-World Space-Time Video Super-Resolution with One-Step Diffusion [64.10689934231165]
Video super-resolution models (DMs) have demonstrated exceptional success in video super-resolution (VSR)<n>Their potential for space-time video super-resolution (STVSR) necessitates recovering realistic visual content from low to high-resolution but also improving the frame rate with coherent dynamics.<n>We propose OSDEnhancer, a framework that represents the first method to initialize real-world STVSR through an efficient one-step diffusion process.<n> Experiments demonstrate that the proposed method achieves state-of-the-art performance while maintaining superior capability in real-world scenarios.
arXiv Detail & Related papers (2026-01-28T06:59:55Z) - Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - VDEGaussian: Video Diffusion Enhanced 4D Gaussian Splatting for Dynamic Urban Scenes Modeling [68.65587507038539]
We present a novel video diffusion-enhanced 4D Gaussian Splatting framework for dynamic urban scene modeling.<n>Our key insight is to distill robust, temporally consistent priors from a test-time adapted video diffusion model.<n>Our method significantly enhances dynamic modeling, especially for fast-moving objects, achieving an approximate PSNR gain of 2 dB.
arXiv Detail & Related papers (2025-08-04T07:24:05Z) - Exploiting Temporal State Space Sharing for Video Semantic Segmentation [53.8810901249897]
Video semantic segmentation (VSS) plays a vital role in understanding the temporal evolution of scenes.<n>Traditional methods often segment videos frame-by-frame or in a short temporal window, leading to limited temporal context, redundant computations, and heavy memory requirements.<n>We introduce a Temporal Video State Space Sharing architecture to leverage Mamba state space models for temporal feature sharing.<n>Our model features a selective gating mechanism that efficiently propagates relevant information across video frames, eliminating the need for a memory-heavy feature pool.
arXiv Detail & Related papers (2025-03-26T01:47:42Z) - HR-INR: Continuous Space-Time Video Super-Resolution via Event Camera [23.121972339114322]
Continuous space-time video super-resolution (C-STVSR) aims to simultaneously enhance video resolution and frame rate at an arbitrary scale.<n>In implicit neural representation (INR) has been applied to video restoration, representing videos as implicit fields that can be decoded at an arbitrary scale.<n>We propose a novel C-STVSR framework, named HR-INR, which captures both holistic dependencies and regional motions based on INR.
arXiv Detail & Related papers (2024-05-22T06:51:32Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Continuous Space-Time Video Super-Resolution Utilizing Long-Range
Temporal Information [48.20843501171717]
We propose a continuous ST-VSR (CSTVSR) method that can convert the given video to any frame rate and spatial resolution.
We show that the proposed algorithm has good flexibility and achieves better performance on various datasets.
arXiv Detail & Related papers (2023-02-26T08:02:39Z) - Enhancing Space-time Video Super-resolution via Spatial-temporal Feature Interaction [11.041058494002467]
The aim of space-time video super-resolution (STVSR) is to increase both the frame rate and the spatial resolution of a video.<n>Recent approaches solve STVSR using end-to-end deep neural networks.<n>We propose a spatial-temporal feature interaction network to enhance STVSR by exploiting both spatial and temporal correlations.
arXiv Detail & Related papers (2022-07-18T22:10:57Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.