Over++: Generative Video Compositing for Layer Interaction Effects
- URL: http://arxiv.org/abs/2512.19661v1
- Date: Mon, 22 Dec 2025 18:39:58 GMT
- Title: Over++: Generative Video Compositing for Layer Interaction Effects
- Authors: Luchao Qi, Jiaye Wu, Jun Myeong Choi, Cary Phillips, Roni Sengupta, Dan B Goldman,
- Abstract summary: We present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision.<n>Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in effect generation and scene preservation.
- Score: 9.071187452755206
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In professional video compositing workflows, artists must manually create environmental interactions-such as shadows, reflections, dust, and splashes-between foreground subjects and background layers. Existing video generative models struggle to preserve the input video while adding such effects, and current video inpainting methods either require costly per-frame masks or yield implausible results. We introduce augmented compositing, a new task that synthesizes realistic, semi-transparent environmental effects conditioned on text prompts and input video layers, while preserving the original scene. To address this task, we present Over++, a video effect generation framework that makes no assumptions about camera pose, scene stationarity, or depth supervision. We construct a paired effect dataset tailored for this task and introduce an unpaired augmentation strategy that preserves text-driven editability. Our method also supports optional mask control and keyframe guidance without requiring dense annotations. Despite training on limited data, Over++ produces diverse and realistic environmental effects and outperforms existing baselines in both effect generation and scene preservation.
Related papers
- Tuning-free Visual Effect Transfer across Videos [91.93897438317397]
RefVFX is a framework that transfers complex temporal effects from a reference video onto a target video or image in a feed-forward manner.<n>We introduce a large-scale dataset of triplets, where each triplet consists of a reference effect video, an input image or video, and a corresponding output video.<n>We show that RefVFX produces visually consistent and temporally coherent edits, generalizes across unseen effect categories, and outperforms prompt-only baselines in both quantitative metrics and human preference.
arXiv Detail & Related papers (2026-01-12T18:59:32Z) - IC-Effect: Precise and Efficient Video Effects Editing via In-Context Learning [13.89445714667069]
IC-Effect is an instruction-guided computation framework for few-shot video VFX editing.<n>It synthesizes complex effects while preserving spatial and temporal consistency.<n>A two-stage training strategy, consisting of general editing adaptation followed by effect-specific learning, ensures strong instruction following and robust effect modeling.
arXiv Detail & Related papers (2025-12-17T17:47:18Z) - GenCompositor: Generative Video Compositing with Diffusion Transformer [68.00271033575736]
Traditional pipelines require intensive labor efforts and expert collaboration, resulting in lengthy production cycles and high manpower costs.<n>This new task strives to adaptively inject identity and motion information of foreground video to the target video in an interactive manner.<n>Experiments demonstrate that our method effectively realizes generative video compositing, outperforming existing possible solutions in fidelity and consistency.
arXiv Detail & Related papers (2025-09-02T16:10:13Z) - VideoPainter: Any-length Video Inpainting and Editing with Plug-and-Play Context Control [47.34885131252508]
Video inpainting aims to restore corrupted video content.<n>We propose a novel dual-stream paradigm VideoPainter to process masked videos.<n>We also introduce a novel target region ID resampling technique that enables any-length video inpainting.
arXiv Detail & Related papers (2025-03-07T17:59:46Z) - FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.<n>It only uses a lightweight sparse control encoder to inject editing signals.<n>It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z) - Generative Image Layer Decomposition with Visual Effects [49.75021036203426]
LayerDecomp is a generative framework for image layer decomposition.<n>It produces clean backgrounds and high-quality transparent foregrounds with faithfully preserved visual effects.<n>Our method achieves superior quality in layer decomposition, outperforming existing approaches in object removal and spatial editing tasks.
arXiv Detail & Related papers (2024-11-26T20:26:49Z) - Generative Omnimatte: Learning to Decompose Video into Layers [29.098471541412113]
We present a novel generative layered video decomposition framework to address the omnimatte problem.<n>Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object.<n>We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset.
arXiv Detail & Related papers (2024-11-25T18:59:57Z) - ActAnywhere: Subject-Aware Video Background Generation [62.57759679425924]
Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community.
This task involves background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention.
We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts.
arXiv Detail & Related papers (2024-01-19T17:16:16Z) - WALDO: Future Video Synthesis using Object Layer Decomposition and
Parametric Flow Prediction [82.79642869586587]
WALDO is a novel approach to the prediction of future video frames from past ones.
Individual images are decomposed into multiple layers combining object masks and a small set of control points.
The layer structure is shared across all frames in each video to build dense inter-frame connections.
arXiv Detail & Related papers (2022-11-25T18:59:46Z) - Text2LIVE: Text-Driven Layered Image and Video Editing [13.134513605107808]
We present a method for zero-shot, text-driven appearance manipulation in natural images and videos.
Given an input image or video and a target text prompt, our goal is to edit the appearance of existing objects.
We demonstrate localized, semantic edits on high-resolution natural images and videos across a variety of objects and scenes.
arXiv Detail & Related papers (2022-04-05T21:17:34Z) - Omnimatte: Associating Objects and Their Effects in Video [100.66205249649131]
Scene effects related to objects in video are typically overlooked by computer vision.
In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video.
Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic---it produces omnimattes automatically for arbitrary objects and a variety of effects.
arXiv Detail & Related papers (2021-05-14T17:57:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.