Related papers: Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation

URL: http://arxiv.org/abs/2409.01055v1
Date: Mon, 2 Sep 2024 08:28:57 GMT
Title: Follow-Your-Canvas: Higher-Resolution Video Outpainting with Extensive Content Generation
Authors: Qihua Chen, Yue Ma, Hongfa Wang, Junkun Yuan, Wenzhe Zhao, Qi Tian, Hongmei Wang, Shaobo Min, Qifeng Chen, Wei Liu,
Abstract summary: This paper explores higher-resolution video outpainting with extensive content generation. It builds upon two core designs: first, instead of employing the common practice of "single-shot" outpainting, we distribute the task across spatial windows and seamlessly merge them. It excels in large-scale video outpainting, e.g. from 512X512 to 1152X2048 (9X), while producing high-quality and aesthetically pleasing results.
Score: 85.0621793883408
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: This paper explores higher-resolution video outpainting with extensive content generation. We point out common issues faced by existing methods when attempting to largely outpaint videos: the generation of low-quality content and limitations imposed by GPU memory. To address these challenges, we propose a diffusion-based method called \textit{Follow-Your-Canvas}. It builds upon two core designs. First, instead of employing the common practice of "single-shot" outpainting, we distribute the task across spatial windows and seamlessly merge them. It allows us to outpaint videos of any size and resolution without being constrained by GPU memory. Second, the source video and its relative positional relation are injected into the generation process of each window. It makes the generated spatial layout within each window harmonize with the source video. Coupling with these two designs enables us to generate higher-resolution outpainting videos with rich content while keeping spatial and temporal consistency. Follow-Your-Canvas excels in large-scale video outpainting, e.g., from 512X512 to 1152X2048 (9X), while producing high-quality and aesthetically pleasing results. It achieves the best quantitative results across various resolution and scale setups. The code is released on https://github.com/mayuelala/FollowYourCanvas

Related papers

OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z)
Video Virtual Try-on with Conditional Diffusion Transformer Inpainter [27.150975905047968]
Video virtual try-on aims to fit a garment to a target person in consecutive video frames.<n>Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution.<n>We propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task.
arXiv Detail & Related papers (2025-06-26T13:56:27Z)
MTV-Inpaint: Multi-Task Long Video Inpainting [30.963300199975656]
Video inpainting involves modifying local regions within a video, ensuring spatial and temporal consistency. Recent advancements in text-to-video (T2V) diffusion models pave the way for text-guided video inpainting. We propose MTV-Inpaint, a unified multi-task video inpainting framework capable of handling both traditional scene completion and novel object insertion tasks.
arXiv Detail & Related papers (2025-03-14T13:54:10Z)
Representing Long Volumetric Video with Temporal Gaussian Hierarchy [80.51373034419379]
This paper aims to address the challenge of reconstructing long volumetric videos from multi-view RGB videos. We propose a novel 4D representation, named Temporal Gaussian Hierarchy, to compactly model long volumetric videos. This work is the first approach capable of efficiently handling minutes of volumetric video data while maintaining state-of-the-art rendering quality.
arXiv Detail & Related papers (2024-12-12T18:59:34Z)
UniPaint: Unified Space-time Video Inpainting via Mixture-of-Experts [20.955898491009656]
UniPaint is a generative space-time video inpainting framework that enables spatial-temporal inpainting. UniPaint produces high-quality and aesthetically pleasing results, achieving the best results across various tasks and scale setups.
arXiv Detail & Related papers (2024-12-09T09:45:14Z)
CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer [55.515836117658985]
We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer. It can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.
arXiv Detail & Related papers (2024-08-12T11:47:11Z)
Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis [69.83405335645305]
We argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. We show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity.
arXiv Detail & Related papers (2024-02-22T18:55:08Z)
AVID: Any-Length Video Inpainting with Diffusion Model [30.860927136236374]
We introduce Any-Length Video Inpainting with Diffusion Model, dubbed as AVID. Our model is equipped with effective motion modules and adjustable structure guidance, for fixed-length video inpainting. Our experiments show our model can robustly deal with various inpainting types at different video duration ranges, with high quality.
arXiv Detail & Related papers (2023-12-06T18:56:14Z)
Hierarchical Masked 3D Diffusion Model for Video Outpainting [20.738731220322176]
We introduce a masked 3D diffusion model for video outpainting. This allows us to use multiple guide frames to connect the results of multiple video clip inferences. We also introduce a hybrid coarse-to-fine inference pipeline to alleviate the artifact accumulation problem.
arXiv Detail & Related papers (2023-09-05T10:52:21Z)
CoordFill: Efficient High-Resolution Image Inpainting via Parameterized Coordinate Querying [52.91778151771145]
In this paper, we try to break the limitations for the first time thanks to the recent development of continuous implicit representation. Experiments show that the proposed method achieves real-time performance on the 2048$times$2048 images using a single GTX 2080 Ti GPU.
arXiv Detail & Related papers (2023-03-15T11:13:51Z)
A Good Image Generator Is What You Need for High-Resolution Video Synthesis [73.82857768949651]
We present a framework that leverages contemporary image generators to render high-resolution videos. We frame the video synthesis problem as discovering a trajectory in the latent space of a pre-trained and fixed image generator. We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
arXiv Detail & Related papers (2021-04-30T15:38:41Z)
Deep Two-Stage High-Resolution Image Inpainting [0.0]
In this article, we propose a method that solves the problem of inpainting arbitrary-size images. For this, we propose to use information from neighboring pixels by shifting the original image in four directions. This approach can work with existing inpainting models, making them almost resolution independent without the need for retraining.
arXiv Detail & Related papers (2021-04-27T20:32:21Z)
DVI: Depth Guided Video Inpainting for Autonomous Driving [35.94330601020169]
We present an automatic video inpainting algorithm that can remove traffic agents from videos. By building a dense 3D map from stitched point clouds, frames within a video are geometrically correlated. We are the first to fuse multiple videos for video inpainting.
arXiv Detail & Related papers (2020-07-17T09:29:53Z)
Very Long Natural Scenery Image Prediction by Outpainting [96.8509015981031]
Outpainting receives less attention due to two challenges in it. First challenge is how to keep the spatial and content consistency between generated images and original input. Second challenge is how to maintain high quality in generated results.
arXiv Detail & Related papers (2019-12-29T16:29:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.