Related papers: EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition

URL: http://arxiv.org/abs/2512.21865v1
Date: Fri, 26 Dec 2025 04:57:59 GMT
Title: EasyOmnimatte: Taming Pretrained Inpainting Diffusion Models for End-to-End Video Layered Decomposition
Authors: Yihan Hu, Xuelin Chen, Xiaodong Cun,
Abstract summary: We introduce Easy Omnimatte, the first unified, end-to-end video omnimatte method.<n>We finetune a video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact.<n>During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps.
Score: 26.91723676903844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing video omnimatte methods typically rely on slow, multi-stage, or inference-time optimization pipelines that fail to fully exploit powerful generative priors, producing suboptimal decompositions. Our key insight is that, if a video inpainting model can be finetuned to remove the foreground-associated effects, then it must be inherently capable of perceiving these effects, and hence can also be finetuned for the complementary task: foreground layer decomposition with associated effects. However, although naïvely finetuning the inpainting model with LoRA applied to all blocks can produce high-quality alpha mattes, it fails to capture associated effects. Our systematic analysis reveals this arises because effect-related cues are primarily encoded in specific DiT blocks and become suppressed when LoRA is applied across all blocks. To address this, we introduce EasyOmnimatte, the first unified, end-to-end video omnimatte method. Concretely, we finetune a pretrained video inpainting diffusion model to learn dual complementary experts while keeping its original weights intact: an Effect Expert, where LoRA is applied only to effect-sensitive DiT blocks to capture the coarse structure of the foreground and associated effects, and a fully LoRA-finetuned Quality Expert learns to refine the alpha matte. During sampling, Effect Expert is used for denoising at early, high-noise steps, while Quality Expert takes over at later, low-noise steps. This design eliminates the need for two full diffusion passes, significantly reducing computational cost without compromising output quality. Ablation studies validate the effectiveness of this Dual-Expert strategy. Experiments demonstrate that EasyOmnimatte sets a new state-of-the-art for video omnimatte and enables various downstream tasks, significantly outperforming baselines in both quality and efficiency.

Related papers

IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning [13.89445714667069]
IC-Effect is an instruction-guided computation framework for few-shot video VFX editing.<n>It synthesizes complex effects while preserving spatial and temporal consistency.<n>A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning, ensures strong instruction following and robust effect modeling.
arXiv Detail & Related papers (2025-12-17T17:47:18Z)
UniSER: A Foundation Model for Unified Soft Effects Removal [72.60782767314713]
We introduce UniSER, capable of addressing diverse degradations caused by soft effects within a single framework.<n>Our methodology centers on curating a massive 3.8M-pair dataset to ensure robustness and generalization.<n>This synergistic approach allows UniSER to significantly outperform both specialist and generalist models.
arXiv Detail & Related papers (2025-11-18T06:39:39Z)
Towards One-step Causal Video Generation via Adversarial Self-Distillation [71.30373662465648]
Recent hybrid video generation models combine autoregressive temporal dynamics with diffusion-based spatial denoising.<n>Our framework produces a single distilled model that flexibly supports multiple inference-step settings.
arXiv Detail & Related papers (2025-11-03T10:12:47Z)
VFXMaster: Unlocking Dynamic Visual Effect Generation via In-Context Learning [67.44716618860544]
We introduce VFXMaster, the first unified, reference-based framework for VFX video generation.<n>It recasts effect generation as an in-context learning task, enabling it to reproduce diverse dynamic effects from a reference video onto target content.<n>In addition, we propose an efficient one-shot effect adaptation mechanism to boost generalization capability on tough unseen effects from a single user-provided video rapidly.
arXiv Detail & Related papers (2025-10-29T17:59:53Z)
Dual-Expert Consistency Model for Efficient and High-Quality Video Generation [57.33788820909211]
We propose a parameter-efficient textbfDual-Expert Consistency Model(DCM), where a semantic expert focuses on learning semantic layout and motion, while a detail expert specializes in fine detail refinement.<n>Our approach achieves state-of-the-art visual quality with significantly reduced sampling steps, demonstrating the effectiveness of expert specialization in video diffusion model distillation.
arXiv Detail & Related papers (2025-06-03T17:55:04Z)
Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion [15.188335671278024]
We propose a new video inpainting framework using optical Flow-guided Efficient Diffusion (FloED) for higher video coherence.<n>FloED employs a dual-branch architecture, where the time-agnostic flow branch restores corrupted flow first, and the multi-scale flow adapters provide motion guidance to the main inpainting branch.<n>Experiments on background restoration and object removal tasks show that FloED outperforms state-of-the-art diffusion-based methods in both quality and efficiency.
arXiv Detail & Related papers (2024-12-01T15:45:26Z)
Video Diffusion Models are Strong Video Inpainter [14.402778136825642]
We propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI)<n>We propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code.<n>Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video.
arXiv Detail & Related papers (2024-08-21T08:01:00Z)
COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z)
Boosting Visual Recognition in Real-world Degradations via Unsupervised Feature Enhancement Module with Deep Channel Prior [22.323789227447755]
Fog, low-light, and motion blur degrade image quality and pose threats to the safety of autonomous driving. This work proposes a novel Deep Channel Prior (DCP) for degraded visual recognition. Based on this, a novel plug-and-play Unsupervised Feature Enhancement Module (UFEM) is proposed to achieve unsupervised feature correction.
arXiv Detail & Related papers (2024-04-02T07:16:56Z)
Learning Task-Oriented Flows to Mutually Guide Feature Alignment in Synthesized and Real Video Denoising [137.5080784570804]
Video denoising aims at removing noise from videos to recover clean ones. Some existing works show that optical flow can help the denoising by exploiting the additional spatial-temporal clues from nearby frames. We propose a new multi-scale refined optical flow-guided video denoising method, which is more robust to different noise levels.
arXiv Detail & Related papers (2022-08-25T00:09:18Z)
Investigating Tradeoffs in Real-World Video Super-Resolution [90.81396836308085]
Real-world video super-resolution (VSR) models are often trained with diverse degradations to improve generalizability. To alleviate the first tradeoff, we propose a degradation scheme that reduces up to 40% of training time without sacrificing performance. To facilitate fair comparisons, we propose the new VideoLQ dataset, which contains a large variety of real-world low-quality video sequences.
arXiv Detail & Related papers (2021-11-24T18:58:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.