Generative Omnimatte: Learning to Decompose Video into Layers
- URL: http://arxiv.org/abs/2411.16683v1
- Date: Mon, 25 Nov 2024 18:59:57 GMT
- Title: Generative Omnimatte: Learning to Decompose Video into Layers
- Authors: Yao-Chih Lee, Erika Lu, Sarah Rumbley, Michal Geyer, Jia-Bin Huang, Tali Dekel, Forrester Cole,
- Abstract summary: We present a novel generative layered video decomposition framework to address the omnimatte problem.
Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object.
We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset.
- Score: 29.098471541412113
- License:
- Abstract: Given a video and a set of input object masks, an omnimatte method aims to decompose the video into semantically meaningful layers containing individual objects along with their associated effects, such as shadows and reflections. Existing omnimatte methods assume a static background or accurate pose and depth estimation and produce poor decompositions when these assumptions are violated. Furthermore, due to the lack of generative prior on natural videos, existing methods cannot complete dynamic occluded regions. We present a novel generative layered video decomposition framework to address the omnimatte problem. Our method does not assume a stationary scene or require camera pose or depth information and produces clean, complete layers, including convincing completions of occluded dynamic regions. Our core idea is to train a video diffusion model to identify and remove scene effects caused by a specific object. We show that this model can be finetuned from an existing video inpainting model with a small, carefully curated dataset, and demonstrate high-quality decompositions and editing results for a wide range of casually captured videos containing soft shadows, glossy reflections, splashing water, and more.
Related papers
- Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering [56.68286440268329]
correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene's lighting, geometry and materials.
We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process.
Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes.
arXiv Detail & Related papers (2024-08-19T05:15:45Z) - Lester: rotoscope animation through video object segmentation and
tracking [0.0]
Lester is a novel method to automatically synthetise retro-style 2D animations from videos.
Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT.
Results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances.
arXiv Detail & Related papers (2024-02-15T11:15:54Z) - ActAnywhere: Subject-Aware Video Background Generation [62.57759679425924]
Generating video background that tailors to foreground subject motion is an important problem for the movie industry and visual effects community.
This task involves background that aligns with the motion and appearance of the foreground subject, while also complies with the artist's creative intention.
We introduce ActAnywhere, a generative model that automates this process which traditionally requires tedious manual efforts.
arXiv Detail & Related papers (2024-01-19T17:16:16Z) - OmnimatteRF: Robust Omnimatte with 3D Background Modeling [42.844343885602214]
We propose a novel video matting method, OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background model.
The 2D layers preserve the details of the subjects, while the 3D background robustly reconstructs scenes in real-world videos.
arXiv Detail & Related papers (2023-09-14T14:36:22Z) - Make-It-4D: Synthesizing a Consistent Long-Term Dynamic Scene Video from
a Single Image [59.18564636990079]
We study the problem of synthesizing a long-term dynamic video from only a single image.
Existing methods either hallucinate inconsistent perpetual views or struggle with long camera trajectories.
We present Make-It-4D, a novel method that can generate a consistent long-term dynamic video from a single image.
arXiv Detail & Related papers (2023-08-20T12:53:50Z) - Unsupervised Object Learning via Common Fate [61.14802390241075]
Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling.
We decompose this problem into three easier subtasks, and provide candidate solutions for each of them.
We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos.
arXiv Detail & Related papers (2021-10-13T08:22:04Z) - Omnimatte: Associating Objects and Their Effects in Video [100.66205249649131]
Scene effects related to objects in video are typically overlooked by computer vision.
In this work, we take a step towards solving this novel problem of automatically associating objects with their effects in video.
Our model is trained only on the input video in a self-supervised manner, without any manual labels, and is generic---it produces omnimattes automatically for arbitrary objects and a variety of effects.
arXiv Detail & Related papers (2021-05-14T17:57:08Z) - Moving SLAM: Fully Unsupervised Deep Learning in Non-Rigid Scenes [85.56602190773684]
We build on the idea of view synthesis, which uses classical camera geometry to re-render a source image from a different point-of-view.
By minimizing the error between the synthetic image and the corresponding real image in a video, the deep network that predicts pose and depth can be trained completely unsupervised.
arXiv Detail & Related papers (2021-05-05T17:08:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.