Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising
- URL: http://arxiv.org/abs/2511.14719v1
- Date: Tue, 18 Nov 2025 18:06:29 GMT
- Title: Zero-shot Synthetic Video Realism Enhancement via Structure-aware Denoising
- Authors: Yifan Wang, Liya Ji, Zhanghan Ke, Harry Yang, Ser-Nam Lim, Qifeng Chen,
- Abstract summary: We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion.<n>Our framework focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains.
- Score: 83.09163795450407
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose an approach to enhancing synthetic video realism, which can re-render synthetic videos from a simulator in photorealistic fashion. Our realism enhancement approach is a zero-shot framework that focuses on preserving the multi-level structures from synthetic videos into the enhanced one in both spatial and temporal domains, built upon a diffusion video foundational model without further fine-tuning. Specifically, we incorporate an effective modification to have the generation/denoising process conditioned on estimated structure-aware information from the synthetic video, such as depth maps, semantic maps, and edge maps, by an auxiliary model, rather than extracting the information from a simulator. This guidance ensures that the enhanced videos are consistent with the original synthetic video at both the structural and semantic levels. Our approach is a simple yet general and powerful approach to enhancing synthetic video realism: we show that our approach outperforms existing baselines in structural consistency with the original video while maintaining state-of-the-art photorealism quality in our experiments.
Related papers
- SkyReels-V3 Technique Report [23.793804843416606]
Video robustness serves as a cornerstone for building world models, where contextual inference stands as the defining test of capability.<n>We present SkyReels-V3, a conditional video generation model, built upon a unified multimodal in-context learning framework with diffusion Transformers.<n>SkyReels-V3 model supports three core generative paradigms within a single architecture: images-to-video synthesis, video-to-video extension and audio-guided video generation.
arXiv Detail & Related papers (2026-01-24T06:08:12Z) - Lumen: Consistent Video Relighting and Harmonious Background Replacement with Video Generative Models [18.008901495139717]
We propose Lumen, an end-to-end video relighting framework developed on large-scale video generative models.<n>For the synthetic domain, we leverage advanced 3D rendering engine to curate video pairs in diverse environments.<n>For the realistic domain, we adapt a HDR-based lighting simulation to complement the lack of paired in-the-wild videos.
arXiv Detail & Related papers (2025-08-18T14:21:22Z) - Compositional Video Synthesis by Temporal Object-Centric Learning [3.2228025627337864]
We present a novel framework for compositional video synthesis that leverages temporally consistent object-centric representations.<n>Our approach explicitly captures temporal dynamics by learning pose invariant object-centric slots and conditioning them on pretrained diffusion models.<n>This design enables high-quality, pixel-level video synthesis with superior temporal coherence.
arXiv Detail & Related papers (2025-07-28T14:11:04Z) - Lumiere: A Space-Time Diffusion Model for Video Generation [75.54967294846686]
We introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once.
This is in contrast to existing video models which synthesize distants followed by temporal super-resolution.
By deploying both spatial and (importantly) temporal down- and up-sampling, our model learns to directly generate a full-frame-rate, low-resolution video.
arXiv Detail & Related papers (2024-01-23T18:05:25Z) - Diffusion Priors for Dynamic View Synthesis from Monocular Videos [59.42406064983643]
Dynamic novel view synthesis aims to capture the temporal evolution of visual content within videos.
We first finetune a pretrained RGB-D diffusion model on the video frames using a customization technique.
We distill the knowledge from the finetuned model to a 4D representations encompassing both dynamic and static Neural Radiance Fields.
arXiv Detail & Related papers (2024-01-10T23:26:41Z) - Make-Your-Video: Customized Video Generation Using Textual and
Structural Guidance [36.26032505627126]
Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only.
In this paper, we explore customized video generation by utilizing text as context description and motion structure.
Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model.
arXiv Detail & Related papers (2023-06-01T17:43:27Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Condensing a Sequence to One Informative Frame for Video Recognition [113.3056598548736]
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame"
A valid question is how to define "useful information" and then distill from a sequence down to one synthetic frame.
IFS consistently demonstrates evident improvements on image-based 2D networks and clip-based 3D networks.
arXiv Detail & Related papers (2022-01-11T16:13:43Z) - Deep CG2Real: Synthetic-to-Real Translation via Image Disentanglement [78.58603635621591]
Training an unpaired synthetic-to-real translation network in image space is severely under-constrained.
We propose a semi-supervised approach that operates on the disentangled shading and albedo layers of the image.
Our two-stage pipeline first learns to predict accurate shading in a supervised fashion using physically-based renderings as targets.
arXiv Detail & Related papers (2020-03-27T21:45:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.