Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
- URL: http://arxiv.org/abs/2508.14483v3
- Date: Fri, 26 Sep 2025 08:11:21 GMT
- Title: Vivid-VR: Distilling Concepts from Text-to-Video Diffusion Transformer for Photorealistic Video Restoration
- Authors: Haoran Bai, Xiaoxu Chen, Canqian Yang, Zongyao He, Sibin Deng, Ying Chen,
- Abstract summary: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model.<n>We show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks.
- Score: 13.36145927735113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents to minimize their propagation through the generation pipeline, and 2) a new ControlNet connector employing a dual-branch design. This connector synergistically combines MLP-based feature mapping with cross-attention mechanism for dynamic control feature retrieval, enabling both content preservation and adaptive control signal modulation. Extensive experiments show that Vivid-VR performs favorably against existing approaches on both synthetic and real-world benchmarks, as well as AIGC videos, achieving impressive texture realism, visual vividness, and temporal consistency. The codes and checkpoints are publicly available at https://github.com/csbhr/Vivid-VR.
Related papers
- UCM: Unifying Camera Control and Memory with Time-aware Positional Encoding Warping for World Models [54.564740558030245]
We present UCM, a novel framework that unifies long-term memory and precise camera control via a time-aware positional encoding warping mechanism.<n>We also introduce a scalable data curation strategy utilizing point-cloud-based rendering to simulate scene revisiting.
arXiv Detail & Related papers (2026-02-26T12:54:46Z) - CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z) - BridgeIV: Bridging Customized Image and Video Generation through Test-Time Autoregressive Identity Propagation [47.21414443162965]
We propose an autoregressive structure and texture propagation module (STPM) for customized text-to-video (CT2V) generation.<n> STPM extracts key structural and texture features from the reference subject and injects them autoregressively into each video frame to enhance consistency.<n>We also introduce a test-time reward optimization (TTRO) method to further refine fine-grained details.
arXiv Detail & Related papers (2025-05-11T14:11:12Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - Enhancing Low-Cost Video Editing with Lightweight Adaptors and Temporal-Aware Inversion [28.38307253613529]
We propose a framework that integrates temporal-spatial and semantic consistency with Baliteral DDIM inversion.<n>Our method significantly improves perceptual quality, text-image alignment, and temporal coherence, as demonstrated on the MSR-VTT dataset.
arXiv Detail & Related papers (2025-01-08T16:41:31Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - Low-Light Video Enhancement via Spatial-Temporal Consistent Decomposition [52.89441679581216]
Low-Light Video Enhancement (LLVE) seeks to restore dynamic or static scenes plagued by severe invisibility and noise.<n>We present an innovative video decomposition strategy that incorporates view-independent and view-dependent components.<n>Our framework consistently outperforms existing methods, establishing a new SOTA performance.
arXiv Detail & Related papers (2024-05-24T15:56:40Z) - Control-A-Video: Controllable Text-to-Video Diffusion Models with Motion Prior and Reward Feedback Learning [50.60891619269651]
Control-A-Video is a controllable T2V diffusion model that can generate videos conditioned on text prompts and reference control maps like edge and depth maps.
We propose novel strategies to incorporate content prior and motion prior into the diffusion-based generation process.
Our framework generates higher-quality, more consistent videos compared to existing state-of-the-art methods in controllable text-to-video generation.
arXiv Detail & Related papers (2023-05-23T09:03:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.