Occlusion-Aware Video Object Inpainting
- URL: http://arxiv.org/abs/2108.06765v1
- Date: Sun, 15 Aug 2021 15:46:57 GMT
- Title: Occlusion-Aware Video Object Inpainting
- Authors: Lei Ke, Yu-Wing Tai, Chi-Keung Tang
- Abstract summary: This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
- Score: 72.38919601150175
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Conventional video inpainting is neither object-oriented nor occlusion-aware,
making it liable to obvious artifacts when large occluded object regions are
inpainted. This paper presents occlusion-aware video object inpainting, which
recovers both the complete shape and appearance for occluded objects in videos
given their visible mask segmentation.
To facilitate this new research, we construct the first large-scale video
object inpainting benchmark YouTube-VOI to provide realistic occlusion
scenarios with both occluded and visible object masks available. Our technical
contribution VOIN jointly performs video object shape completion and occluded
texture generation. In particular, the shape completion module models
long-range object coherence while the flow completion module recovers accurate
flow with sharp motion boundary, for propagating temporally-consistent texture
to the same moving object across frames. For more realistic results, VOIN is
optimized using both T-PatchGAN and a new spatio-temporal attention-based
multi-class discriminator.
Finally, we compare VOIN and strong baselines on YouTube-VOI. Experimental
results clearly demonstrate the efficacy of our method including inpainting
complex and dynamic objects. VOIN degrades gracefully with inaccurate input
visible mask.
Related papers
- InVi: Object Insertion In Videos Using Off-the-Shelf Diffusion Models [46.587906540660455]
We introduce InVi, an approach for inserting or replacing objects within videos using off-the-shelf, text-to-image latent diffusion models.
InVi achieves realistic object insertion with consistent blending and coherence across frames, outperforming existing methods.
arXiv Detail & Related papers (2024-07-15T17:55:09Z) - Rethinking Image-to-Video Adaptation: An Object-centric Perspective [61.833533295978484]
We propose a novel and efficient image-to-video adaptation strategy from the object-centric perspective.
Inspired by human perception, we integrate a proxy task of object discovery into image-to-video transfer learning.
arXiv Detail & Related papers (2024-07-09T13:58:10Z) - OR-NeRF: Object Removing from 3D Scenes Guided by Multiview Segmentation
with Neural Radiance Fields [53.32527220134249]
The emergence of Neural Radiance Fields (NeRF) for novel view synthesis has increased interest in 3D scene editing.
Current methods face challenges such as time-consuming object labeling, limited capability to remove specific targets, and compromised rendering quality after removal.
This paper proposes a novel object-removing pipeline, named OR-NeRF, that can remove objects from 3D scenes with user-given points or text prompts on a single view.
arXiv Detail & Related papers (2023-05-17T18:18:05Z) - One-Shot Video Inpainting [5.7120338754738835]
We propose a unified pipeline for one-shot video inpainting (OSVI)
By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task.
Our method is more reliable because the predicted masks can be used as the network's internal guidance.
arXiv Detail & Related papers (2023-02-28T07:30:36Z) - Breaking the "Object" in Video Object Segmentation [36.20167854011788]
We present a dataset for Video Object under Transformations (VOST)
It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long average and densely labeled with masks instance.
A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent.
We show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues.
arXiv Detail & Related papers (2022-12-12T19:22:17Z) - Neural Assets: Volumetric Object Capture and Rendering for Interactive
Environments [8.258451067861932]
We propose an approach for capturing real-world objects in everyday environments faithfully and fast.
We use a novel neural representation to reconstruct effects, such as translucent object parts, and preserve object appearance.
This leads to a seamless integration of the proposed neural assets with existing mesh environments and objects.
arXiv Detail & Related papers (2022-12-12T18:55:03Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Decoupled Spatial-Temporal Transformer for Video Inpainting [77.8621673355983]
Video aims to fill the given holes with realistic appearance but is still a challenging task even with prosperous deep learning approaches.
Recent works introduce the promising Transformer architecture into deep video inpainting and achieve better performance.
We propose a Decoupled Spatial-Temporal Transformer (DSTT) for improving video inpainting with exceptional efficiency.
arXiv Detail & Related papers (2021-04-14T05:47:46Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.