ANYPORTAL: Zero-Shot Consistent Video Background Replacement
- URL: http://arxiv.org/abs/2509.07472v1
- Date: Tue, 09 Sep 2025 07:50:53 GMT
- Title: ANYPORTAL: Zero-Shot Consistent Video Background Replacement
- Authors: Wenshuo Gao, Xicheng Lan, Shuai Yang,
- Abstract summary: ANYPORTAL is a zero-shot framework for video background replacement.<n>It integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting.<n>It overcomes the challenges of achieving foreground consistency and temporally coherent relighting.
- Score: 8.690698677022992
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite the rapid advancements in video generation technology, creating high-quality videos that precisely align with user intentions remains a significant challenge. Existing methods often fail to achieve fine-grained control over video details, limiting their practical applicability. We introduce ANYPORTAL, a novel zero-shot framework for video background replacement that leverages pre-trained diffusion models. Our framework collaboratively integrates the temporal prior of video diffusion models with the relighting capabilities of image diffusion models in a zero-shot setting. To address the critical challenge of foreground consistency, we propose a Refinement Projection Algorithm, which enables pixel-level detail manipulation to ensure precise foreground preservation. ANYPORTAL is training-free and overcomes the challenges of achieving foreground consistency and temporally coherent relighting. Experimental results demonstrate that ANYPORTAL achieves high-quality results on consumer-grade GPUs, offering a practical and efficient solution for video content creation and editing.
Related papers
- VDOT: Efficient Unified Video Creation via Optimal Transport Distillation [70.02065520468726]
We propose an efficient unified video creation model, named VDOT.<n>We employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions.<n>To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering.
arXiv Detail & Related papers (2025-12-07T11:31:00Z) - Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z) - ReLumix: Extending Image Relighting to Video via Video Diffusion Models [5.890782804843724]
Controlling illumination during video post-production is a crucial yet elusive goal in computational photography.<n>This paper introduces ReLumix, a novel framework that decouples the relighting from temporal synthesis.<n>Although trained on synthetic data, ReLumix shows competitive generalization to real-world videos.
arXiv Detail & Related papers (2025-09-28T09:35:33Z) - FADE: Frequency-Aware Diffusion Model Factorization for Video Editing [34.887298437323295]
FADE is a training-free yet highly effective video editing approach.<n>We propose a factorization strategy to optimize each component's specialized role.<n>Experiments on real-world videos demonstrate that our method consistently delivers high-quality, realistic and temporally coherent editing results.
arXiv Detail & Related papers (2025-06-06T10:00:39Z) - DiffuseSlide: Training-Free High Frame Rate Video Generation Diffusion [4.863177884263436]
We present a training-free approach for high FPS video generation using pre-trained diffusion models.<n>Our method, DiffuseSlide, introduces a new pipeline that leverages key frames from low FPS videos and applies innovative techniques, including noise re-injection and sliding window latent denoising.<n>Through extensive experiments, we demonstrate that our approach significantly improves video quality, offering enhanced temporal coherence and spatial fidelity.
arXiv Detail & Related papers (2025-06-02T09:12:41Z) - Subject-driven Video Generation via Disentangled Identity and Motion [52.54835936914813]
We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning.<n>Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings.
arXiv Detail & Related papers (2025-04-23T06:48:31Z) - Taming Rectified Flow for Inversion and Editing [57.3742655030493]
Rectified-flow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation.<n>Despite their robust generative capabilities, these models often struggle with inaccuracies.<n>We propose RF-r, a training-free sampler that effectively enhances inversion precision by mitigating the errors in the inversion process of rectified flow.
arXiv Detail & Related papers (2024-11-07T14:29:02Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - NaRCan: Natural Refined Canonical Image with Integration of Diffusion Prior for Video Editing [3.6344789837383145]
We propose a video editing framework, NaRCan, which integrates a hybrid deformation field and diffusion prior to generate high-quality natural canonical images.
Our approach utilizes homography to model global motion and employs multi-layer perceptrons (MLPs) to capture local residual deformations.
Our method outperforms existing approaches in various video editing tasks and produces coherent and high-quality edited video sequences.
arXiv Detail & Related papers (2024-06-10T17:59:46Z) - Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content.
We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z) - Dreamix: Video Diffusion Models are General Video Editors [22.127604561922897]
Text-driven image and video diffusion models have recently achieved unprecedented generation realism.
We present the first diffusion-based method that is able to perform text-based motion and appearance editing of general videos.
arXiv Detail & Related papers (2023-02-02T18:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.