Short-Term and Long-Term Context Aggregation Network for Video
Inpainting
- URL: http://arxiv.org/abs/2009.05721v1
- Date: Sat, 12 Sep 2020 03:50:56 GMT
- Title: Short-Term and Long-Term Context Aggregation Network for Video
Inpainting
- Authors: Ang Li, Shanshan Zhao, Xingjun Ma, Mingming Gong, Jianzhong Qi, Rui
Zhang, Dacheng Tao, Ramamohanarao Kotagiri
- Abstract summary: Video inpainting aims to restore missing regions of a video and has many applications such as video editing and object removal.
We present a novel context aggregation network to effectively exploit both short-term and long-term frame information for video inpainting.
Experiments show that it outperforms state-of-the-art methods with better inpainting results and fast inpainting speed.
- Score: 126.06302824297948
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video inpainting aims to restore missing regions of a video and has many
applications such as video editing and object removal. However, existing
methods either suffer from inaccurate short-term context aggregation or rarely
explore long-term frame information. In this work, we present a novel context
aggregation network to effectively exploit both short-term and long-term frame
information for video inpainting. In the encoding stage, we propose
boundary-aware short-term context aggregation, which aligns and aggregates,
from neighbor frames, local regions that are closely related to the boundary
context of missing regions into the target frame. Furthermore, we propose
dynamic long-term context aggregation to globally refine the feature map
generated in the encoding stage using long-term frame features, which are
dynamically updated throughout the inpainting process. Experiments show that it
outperforms state-of-the-art methods with better inpainting results and fast
inpainting speed.
Related papers
- Semantically Consistent Video Inpainting with Conditional Diffusion Models [16.42354856518832]
We present a framework for solving problems with conditional video diffusion models.
We introduce inpainting-specific sampling schemes which capture crucial long-range dependencies in the context.
We devise a novel method for conditioning on the known pixels in incomplete frames.
arXiv Detail & Related papers (2024-04-30T23:49:26Z) - VidToMe: Video Token Merging for Zero-Shot Video Editing [100.79999871424931]
We propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
Our method improves temporal coherence and reduces memory consumption in self-attention computations.
arXiv Detail & Related papers (2023-12-17T09:05:56Z) - Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos.
Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z) - Exemplar-based Video Colorization with Long-term Spatiotemporal
Dependency [10.223719035434586]
Exear-based video colorization is an essential technique for applications like old movie restoration.
We propose an exemplar-based video colorization framework with long-term temporal dependency dependency.
Our model can generate more colorful, realistic and stabilized results, especially for scenes where objects change greatly and irregularly.
arXiv Detail & Related papers (2023-03-27T10:45:00Z) - Temporal Perceiving Video-Language Pre-training [112.1790287726804]
This work introduces a novel text-video localization pre-text task to enable fine-grained temporal and semantic alignment.
Specifically, text-video localization consists of moment retrieval, which predicts start and end boundaries in videos given the text description.
Our method connects the fine-grained frame representations with the word representations and implicitly distinguishes representations of different instances in the single modality.
arXiv Detail & Related papers (2023-01-18T12:15:47Z) - Generating Long Videos of Dynamic Scenes [66.56925105992472]
We present a video generation model that reproduces object motion, changes in camera viewpoint, and new content that arises over time.
A common failure case is for content to never change due to over-reliance on inductive biases to provide temporal consistency.
arXiv Detail & Related papers (2022-06-07T16:29:51Z) - Spatial-Temporal Residual Aggregation for High Resolution Video
Inpainting [14.035620730770528]
Recent learning-based inpainting algorithms have achieved compelling results for completing missing regions after removing undesired objects in videos.
We propose STRA-Net, a novel spatial-temporal residual aggregation framework for high resolution video inpainting.
Both the quantitative and qualitative evaluations show that we can produce more temporal-coherent and visually appealing results than the state-of-the-art methods on inpainting high resolution videos.
arXiv Detail & Related papers (2021-11-05T15:50:31Z) - Internal Video Inpainting by Implicit Long-range Propagation [39.89676105875726]
We propose a novel framework for video inpainting by adopting an internal learning strategy.
We show that this can be achieved implicitly by fitting a convolutional neural network to the known region.
We extend the proposed method to another challenging task: learning to remove an object from a video giving a single object mask in only one frame in a 4K video.
arXiv Detail & Related papers (2021-08-04T08:56:28Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Coarse-Fine Networks for Temporal Activity Detection in Videos [45.03545172714305]
We introduce 'Co-Fine Networks', a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion.
We show that our method can outperform the state-of-the-arts for action detection in public datasets with a significantly reduced compute and memory footprint.
arXiv Detail & Related papers (2021-03-01T20:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.