EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition
- URL: http://arxiv.org/abs/2512.21865v1
- Date: Fri, 26 Dec 2025 04:57:59 GMT
- Title: EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition
- Authors: Yihan Hu, Xuelin Chen, Xiaodong Cun,
- Abstract summary: We introduce Easy Omnimatte, the first unified, end-to-end video omnimatte method.<n>We finetune a video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact.<n>During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps.
- Score: 26.91723676903844
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.
Related papers
- IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning [13.89445714667069]
IC-Effect is an instruction-guided computation framework for few-shot video VFX editing.<n>It synthesizes complex effects while preserving spatial and temporal consistency.<n>A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning, ensures strong instruction following and robust effect modeling.
arXiv Detail & Related papers (2025-12-17T17:47:18Z) - UniSER: A Foundation Model for Unified Soft Effects Removal [72.60782767314713]
We introduce UniSER, capable of addressing diverse degradations caused by soft effects within a single framework.<n>Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization.<n>This synergistic approach allows UniSER to significantly outperform both specialist and generalist models.
arXiv Detail & Related papers (2025-11-18T06:39:39Z) - Towards One-step Causal Video Generation via Adversarial Self-Distillation [71.30373662465648]
Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising.<n>Our framework produces a single distilled model that flexibly supports multiple inference-step settings.
arXiv Detail & Related papers (2025-11-03T10:12:47Z) - VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning [67.44716618860544]
We introduce VFXMaster, the first unified, reference-based framework for VFX video generation.<n>It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content.<n>In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly.
arXiv Detail & Related papers (2025-10-29T17:59:53Z) - Dual-Expert Consistency Model for Efficient and High-Quality Video Generation [57.33788820909211]
We propose a parameter-efficient textbfDual-Expert Consistency Model(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement.<n>Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.
arXiv Detail & Related papers (2025-06-03T17:55:04Z) - Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion [15.188335671278024]
We propose a new video inpainting framework using optical Flow-guided Efficient Diffusion (FloED) for higher video coherence.<n>FloED employs a dual-branch architecture, where the time-agnostic flow branch restores corrupted flow first, and the multi-scale flow adapters provide motion guidance to the main inpainting branch.<n>Experiments on background restoration and object removal tasks show that FloED outperforms state-of-the-art diffusion-based methods in both quality and efficiency.
arXiv Detail & Related papers (2024-12-01T15:45:26Z) - Video Diffusion Models are Strong Video Inpainter [14.402778136825642]
We propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI)<n>We propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code.<n>Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video.
arXiv Detail & Related papers (2024-08-21T08:01:00Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - Boosting Visual Recognition in Real-world Degradations via Unsupervised Feature Enhancement Module with Deep Channel Prior [22.323789227447755]
Fog, low-light, and motion blur degrade image quality and pose threats to the safety of autonomous driving.
This work proposes a novel Deep Channel Prior (DCP) for degraded visual recognition.
Based on this, a novel plug-and-play Unsupervised Feature Enhancement Module (UFEM) is proposed to achieve unsupervised feature correction.
arXiv Detail & Related papers (2024-04-02T07:16:56Z) - Learning Task-Oriented Flows to Mutually Guide Feature Alignment in
Synthesized and Real Video Denoising [137.5080784570804]
Video denoising aims at removing noise from videos to recover clean ones.
Some existing works show that optical flow can help the denoising by exploiting the additional spatial-temporal clues from nearby frames.
We propose a new multi-scale refined optical flow-guided video denoising method, which is more robust to different noise levels.
arXiv Detail & Related papers (2022-08-25T00:09:18Z) - Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability.
To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance.
To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.