Elevating Flow-Guided Video Inpainting with Reference Generation
- URL: http://arxiv.org/abs/2412.08975v1
- Date: Thu, 12 Dec 2024 06:13:00 GMT
- Title: Elevating Flow-Guided Video Inpainting with Reference Generation
- Authors: Suhwan Cho, Seoung Wug Oh, Sangyoun Lee, Joon-Young Lee,
- Abstract summary: Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video.
We propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm.
Our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts.
- Score: 50.03502211226332
- License:
- Abstract: Video inpainting (VI) is a challenging task that requires effective propagation of observable content across frames while simultaneously generating new content not present in the original video. In this study, we propose a robust and practical VI framework that leverages a large generative model for reference generation in combination with an advanced pixel propagation algorithm. Powered by a strong generative model, our method not only significantly enhances frame-level quality for object removal but also synthesizes new content in the missing areas based on user-provided text prompts. For pixel propagation, we introduce a one-shot pixel pulling method that effectively avoids error accumulation from repeated sampling while maintaining sub-pixel precision. To evaluate various VI methods in realistic scenarios, we also propose a high-quality VI benchmark, HQVI, comprising carefully generated videos using alpha matte composition. On public benchmarks and the HQVI dataset, our method demonstrates significantly higher visual quality and metric scores compared to existing solutions. Furthermore, it can process high-resolution videos exceeding 2K resolution with ease, underscoring its superiority for real-world applications.
Related papers
- CoordFlow: Coordinate Flow for Pixel-wise Neural Video Representation [11.364753833652182]
Implicit Neural Representation (INR) is a promising alternative to traditional transform-based methodologies.
We introduce CoordFlow, a novel pixel-wise INR for video compression.
It yields state-of-the-art results compared to other pixel-wise INRs and on-par performance compared to leading frame-wise techniques.
arXiv Detail & Related papers (2025-01-01T22:58:06Z) - HyViLM: Enhancing Fine-Grained Recognition with a Hybrid Encoder for Vision-Language Models [15.128058747088222]
HyViLM is designed to process images of any resolution while retaining the overall context during encoding.
Compared with the state-of-the-art MLLMs under the same setting, our HyViLM outperforms existing MLLMs in nine out of ten tasks.
arXiv Detail & Related papers (2024-12-11T13:41:21Z) - VISION-XL: High Definition Video Inverse Problem Solver using Latent Image Diffusion Models [58.464465016269614]
In this paper, we propose a framework for solving high-definition video inverse problems using latent image diffusion models.
Our approach leverages latent-space diffusion models to achieve enhanced video quality and resolution.
Unlike previous methods, our approach supports multiple aspect ratios and delivers HD-resolution reconstructions in under 2.5 minutes on a single GPU.
arXiv Detail & Related papers (2024-11-29T08:10:49Z) - Video Interpolation with Diffusion Models [54.06746595879689]
We present VIDIM, a generative model for video, which creates short videos given a start and end frame.
VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video.
arXiv Detail & Related papers (2024-04-01T15:59:32Z) - Inflation with Diffusion: Efficient Temporal Adaptation for
Text-to-Video Super-Resolution [19.748048455806305]
We propose an efficient diffusion-based text-to-video super-resolution (SR) tuning approach.
We investigate different tuning approaches based on our inflated architecture and report trade-offs between computational costs and super-resolution quality.
arXiv Detail & Related papers (2024-01-18T22:25:16Z) - VideoGen: A Reference-Guided Latent Diffusion Approach for High
Definition Text-to-Video Generation [73.54366331493007]
VideoGen is a text-to-video generation approach, which can generate a high-definition video with high frame fidelity and strong temporal consistency.
We leverage an off-the-shelf text-to-image generation model, e.g., Stable Diffusion, to generate an image with high content quality from the text prompt.
arXiv Detail & Related papers (2023-09-01T11:14:43Z) - In-N-Out Generative Learning for Dense Unsupervised Video Segmentation [89.21483504654282]
In this paper, we focus on the unsupervised Video Object (VOS) task which learns visual correspondence from unlabeled videos.
We propose the In-aNd-Out (INO) generative learning from a purely generative perspective, which captures both high-level and fine-grained semantics.
Our INO outperforms previous state-of-the-art methods by significant margins.
arXiv Detail & Related papers (2022-03-29T07:56:21Z) - High Fidelity Interactive Video Segmentation Using Tensor Decomposition
Boundary Loss Convolutional Tessellations and Context Aware Skip Connections [0.0]
We provide a high fidelity deep learning algorithm (HyperSeg) for interactive video segmentation tasks.
Our model crucially processes and renders all image features in high resolution, without utilizing downsampling or pooling procedures.
Our work can be used across a broad range of application domains, including VFX pipelines and medical imaging disciplines.
arXiv Detail & Related papers (2020-11-23T18:21:42Z) - Enhanced Quadratic Video Interpolation [56.54662568085176]
We propose an enhanced quadratic video (EQVI) model to handle more complicated scenes and motion patterns.
To further boost the performance, we devise a novel multi-scale fusion network (MS-Fusion) which can be regarded as a learnable augmentation process.
The proposed EQVI model won the first place in the AIM 2020 Video Temporal Super-Resolution Challenge.
arXiv Detail & Related papers (2020-09-10T02:31:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.