SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution
- URL: http://arxiv.org/abs/2410.05799v4
- Date: Sat, 26 Oct 2024 06:11:30 GMT
- Title: SeeClear: Semantic Distillation Enhances Pixel Condensation for Video Super-Resolution
- Authors: Qi Tang, Yao Zhao, Meiqin Liu, Chao Yao,
- Abstract summary: Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos.
We introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls.
Our framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames.
- Score: 35.894647722880805
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion-based Video Super-Resolution (VSR) is renowned for generating perceptually realistic videos, yet it grapples with maintaining detail consistency across frames due to stochastic fluctuations. The traditional approach of pixel-level alignment is ineffective for diffusion-processed frames because of iterative disruptions. To overcome this, we introduce SeeClear--a novel VSR framework leveraging conditional video generation, orchestrated by instance-centric and channel-wise semantic controls. This framework integrates a Semantic Distiller and a Pixel Condenser, which synergize to extract and upscale semantic details from low-resolution frames. The Instance-Centric Alignment Module (InCAM) utilizes video-clip-wise tokens to dynamically relate pixels within and across frames, enhancing coherency. Additionally, the Channel-wise Texture Aggregation Memory (CaTeGory) infuses extrinsic knowledge, capitalizing on long-standing semantic textures. Our method also innovates the blurring diffusion process with the ResShift mechanism, finely balancing between sharpness and diffusion effects. Comprehensive experiments confirm our framework's advantage over state-of-the-art diffusion-based VSR techniques. The code is available: https://github.com/Tang1705/SeeClear-NeurIPS24.
Related papers
- OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z) - FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.
It only uses a lightweight sparse control encoder to inject editing signals.
It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z) - CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation [11.364753833652182]
Implicit Neural Representation (INR) is a promising alternative to traditional transform-based methodologies.
We introduce CoordFlow, a novel pixel-wise INR for video compression.
It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques.
arXiv Detail & Related papers (2025-01-01T22:58:06Z) - DaBiT: Depth and Blur informed Transformer for Joint Refocusing and Super-Resolution [4.332534893042983]
In many real-world scenarios, recorded videos suffer from accidental focus blur.
This paper introduces a framework optimised for focal deblurring (refocusing) and video super-resolution (VSR)
We achieve state-of-the-art results with an average PSNR performance over 1.9dB greater than comparable existing video restoration methods.
arXiv Detail & Related papers (2024-07-01T12:22:16Z) - Perception-Oriented Video Frame Interpolation via Asymmetric Blending [20.0024308216849]
Previous methods for Video Frame Interpolation (VFI) have encountered challenges, notably the manifestation of blur and ghosting effects.
We propose PerVFI (Perception-oriented Video Frame Interpolation) to mitigate these challenges.
Experimental results validate the superiority of PerVFI, demonstrating significant improvements in perceptual quality compared to existing methods.
arXiv Detail & Related papers (2024-04-10T02:40:17Z) - Learning Spatiotemporal Inconsistency via Thumbnail Layout for Face Deepfake Detection [41.35861722481721]
Deepfake threats to society and cybersecurity have provoked significant public apprehension.
This paper introduces an elegantly simple yet effective strategy named Thumbnail Layout (TALL)
TALL transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies.
arXiv Detail & Related papers (2024-03-15T12:48:44Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Reuse and Diffuse: Iterative Denoising for Text-to-Video Generation [92.55296042611886]
We propose a framework called "Reuse and Diffuse" dubbed $textitVidRD$ to produce more frames following the frames already generated by an LDM.
We also propose a set of strategies for composing video-text data that involve diverse content from multiple existing datasets.
arXiv Detail & Related papers (2023-09-07T08:12:58Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - FFNeRV: Flow-Guided Frame-Wise Neural Representations for Videos [5.958701846880935]
We propose FFNeRV, a novel method for incorporating flow information into frame-wise representations to exploit the temporal redundancy across the frames in videos.
With model compression techniques, FFNeRV outperforms widely-used standard video codecs (H.264 and HEVC) and performs on par with state-of-the-art video compression algorithms.
arXiv Detail & Related papers (2022-12-23T12:51:42Z) - Memory-Augmented Non-Local Attention for Video Super-Resolution [61.55700315062226]
We propose a novel video super-resolution method that aims at generating high-fidelity high-resolution (HR) videos from low-resolution (LR) ones.
Previous methods predominantly leverage temporal neighbor frames to assist the super-resolution of the current frame.
In contrast, we devise a cross-frame non-local attention mechanism that allows video super-resolution without frame alignment.
arXiv Detail & Related papers (2021-08-25T05:12:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.