One-Shot Video Inpainting
- URL: http://arxiv.org/abs/2302.14362v1
- Date: Tue, 28 Feb 2023 07:30:36 GMT
- Title: One-Shot Video Inpainting
- Authors: Sangjin Lee, Suhwan Cho, Sangyoun Lee
- Abstract summary: We propose a unified pipeline for one-shot video inpainting (OSVI)
By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task.
Our method is more reliable because the predicted masks can be used as the network's internal guidance.
- Score: 5.7120338754738835
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Recently, removing objects from videos and filling in the erased regions
using deep video inpainting (VI) algorithms has attracted considerable
attention. Usually, a video sequence and object segmentation masks for all
frames are required as the input for this task. However, in real-world
applications, providing segmentation masks for all frames is quite difficult
and inefficient. Therefore, we deal with VI in a one-shot manner, which only
takes the initial frame's object mask as its input. Although we can achieve
that using naive combinations of video object segmentation (VOS) and VI
methods, they are sub-optimal and generally cause critical errors. To address
that, we propose a unified pipeline for one-shot video inpainting (OSVI). By
jointly learning mask prediction and video completion in an end-to-end manner,
the results can be optimal for the entire task instead of each separate module.
Additionally, unlike the two stage methods that use the predicted masks as
ground truth cues, our method is more reliable because the predicted masks can
be used as the network's internal guidance. On the synthesized datasets for
OSVI, our proposed method outperforms all others both quantitatively and
qualitatively.
Related papers
- Text-Guided Video Masked Autoencoder [12.321239366215426]
We introduce a novel text-guided masking algorithm (TGM) that masks the video regions with highest correspondence to paired captions.
We show that across existing masking algorithms, unifying MAE and masked video-text contrastive learning improves downstream performance compared to pure MAE.
arXiv Detail & Related papers (2024-08-01T17:58:19Z) - Lester: rotoscope animation through video object segmentation and
tracking [0.0]
Lester is a novel method to automatically synthetise retro-style 2D animations from videos.
Video frames are processed with the Segment Anything Model (SAM) and the resulting masks are tracked through subsequent frames with DeAOT.
Results show that the method exhibits an excellent temporal consistency and can correctly process videos with different poses and appearances.
arXiv Detail & Related papers (2024-02-15T11:15:54Z) - Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence.
We propose an efficient mask propagation framework for VSS, called SSSS.
Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z) - Siamese Masked Autoencoders [76.35448665609998]
We present Siamese Masked Autoencoders (SiamMAE) for learning visual correspondence from videos.
SiamMAE operates on pairs of randomly sampled video frames and asymmetrically masks them.
It outperforms state-of-the-art self-supervised methods on video object segmentation, pose keypoint propagation, and semantic part propagation tasks.
arXiv Detail & Related papers (2023-05-23T17:59:46Z) - Mask-Free Video Instance Segmentation [102.50936366583106]
Video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets.
We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state.
Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection.
arXiv Detail & Related papers (2023-03-28T11:48:07Z) - Semi-Supervised Video Inpainting with Cycle Consistency Constraints [13.414206652584236]
We propose an end-to-end trainable framework consisting of completion network and mask prediction network.
We generate corrupted contents of the current frame using the known mask and decide the regions to be filled of the next frame, respectively.
Our model is trained in a semi-supervised manner, but it can achieve comparable performance as fully-supervised methods.
arXiv Detail & Related papers (2022-08-14T08:46:37Z) - Occlusion-Aware Video Object Inpainting [72.38919601150175]
This paper presents occlusion-aware video object inpainting, which recovers both the complete shape and appearance for occluded objects in videos.
Our technical contribution VOIN jointly performs video object shape completion and occluded texture generation.
For more realistic results, VOIN is optimized using both T-PatchGAN and a newoc-temporal YouTube attention-based multi-class discriminator.
arXiv Detail & Related papers (2021-08-15T15:46:57Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z) - MSN: Efficient Online Mask Selection Network for Video Instance
Segmentation [7.208483056781188]
We present a novel solution for Video Instance(VIS) that is automatically generating instance level segmentation masks along with object class and tracking them in a video.
Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN)
Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams.
arXiv Detail & Related papers (2021-06-19T08:33:29Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.