Semi-Supervised Video Inpainting with Cycle Consistency Constraints
- URL: http://arxiv.org/abs/2208.06807v1
- Date: Sun, 14 Aug 2022 08:46:37 GMT
- Title: Semi-Supervised Video Inpainting with Cycle Consistency Constraints
- Authors: Zhiliang Wu, Hanyu Xuan, Changchang Sun, Kang Zhang, Yan Yan
- Abstract summary: We propose an end-to-end trainable framework consisting of completion network and mask prediction network.
We generate corrupted contents of the current frame using the known mask and decide the regions to be filled of the next frame, respectively.
Our model is trained in a semi-supervised manner, but it can achieve comparable performance as fully-supervised methods.
- Score: 13.414206652584236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based video inpainting has yielded promising results and gained
increasing attention from researchers. Generally, these methods usually assume
that the corrupted region masks of each frame are known and easily obtained.
However, the annotation of these masks are labor-intensive and expensive, which
limits the practical application of current methods. Therefore, we expect to
relax this assumption by defining a new semi-supervised inpainting setting,
making the networks have the ability of completing the corrupted regions of the
whole video using the annotated mask of only one frame. Specifically, in this
work, we propose an end-to-end trainable framework consisting of completion
network and mask prediction network, which are designed to generate corrupted
contents of the current frame using the known mask and decide the regions to be
filled of the next frame, respectively. Besides, we introduce a cycle
consistency loss to regularize the training parameters of these two networks.
In this way, the completion network and the mask prediction network can
constrain each other, and hence the overall performance of the trained model
can be maximized. Furthermore, due to the natural existence of prior knowledge
(e.g., corrupted contents and clear borders), current video inpainting datasets
are not suitable in the context of semi-supervised video inpainting. Thus, we
create a new dataset by simulating the corrupted video of real-world scenarios.
Extensive experimental results are reported to demonstrate the superiority of
our model in the video inpainting task. Remarkably, although our model is
trained in a semi-supervised manner, it can achieve comparable performance as
fully-supervised methods.
Related papers
- Mask to reconstruct: Cooperative Semantics Completion for Video-text
Retrieval [19.61947785487129]
Mask for Semantics Completion (MASCOT) based on semantic-based masked modeling.
Our MASCOT performs state-of-the-art performance on four major text-video retrieval benchmarks.
arXiv Detail & Related papers (2023-05-13T12:31:37Z) - One-Shot Video Inpainting [5.7120338754738835]
We propose a unified pipeline for one-shot video inpainting (OSVI)
By jointly learning mask prediction and video completion in an end-to-end manner, the results can be optimal for the entire task.
Our method is more reliable because the predicted masks can be used as the network's internal guidance.
arXiv Detail & Related papers (2023-02-28T07:30:36Z) - Exploiting Shape Cues for Weakly Supervised Semantic Segmentation [15.791415215216029]
Weakly supervised semantic segmentation (WSSS) aims to produce pixel-wise class predictions with only image-level labels for training.
We propose to exploit shape information to supplement the texture-biased property of convolutional neural networks (CNNs)
We further refine the predictions in an online fashion with a novel refinement method that takes into account both the class and the color affinities.
arXiv Detail & Related papers (2022-08-08T17:25:31Z) - Learning Prior Feature and Attention Enhanced Image Inpainting [63.21231753407192]
This paper incorporates the pre-training based Masked AutoEncoder (MAE) into the inpainting model.
We propose to use attention priors from MAE to make the inpainting model learn more long-distance dependencies between masked and unmasked regions.
arXiv Detail & Related papers (2022-08-03T04:32:53Z) - What You See is What You Classify: Black Box Attributions [61.998683569022006]
We train a deep network, the Explainer, to predict attributions for a pre-trained black-box classifier, the Explanandum.
Unlike most existing approaches, ours is capable of directly generating very distinct class-specific masks.
We show that our attributions are superior to established methods both visually and quantitatively.
arXiv Detail & Related papers (2022-05-23T12:30:04Z) - Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in
Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos.
We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks.
The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z) - Spatiotemporal Graph Neural Network based Mask Reconstruction for Video
Object Segmentation [70.97625552643493]
This paper addresses the task of segmenting class-agnostic objects in semi-supervised setting.
We propose a novel graph neuralS network (TG-Net) which captures the local contexts by utilizing all proposals.
arXiv Detail & Related papers (2020-12-10T07:57:44Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z) - Self-Supervised Scene De-occlusion [186.89979151728636]
This paper investigates the problem of scene de-occlusion, which aims to recover the underlying occlusion ordering and complete the invisible parts of occluded objects.
We make the first attempt to address the problem through a novel and unified framework that recovers hidden scene structures without ordering and amodal annotations as supervisions.
Based on PCNet-M and PCNet-C, we devise a novel inference scheme to accomplish scene de-occlusion, via progressive ordering recovery, amodal completion and content completion.
arXiv Detail & Related papers (2020-04-06T16:31:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.