Related papers: Structure and Content-Guided Video Synthesis with Diffusion Models

Structure and Content-Guided Video Synthesis with Diffusion Models

URL: http://arxiv.org/abs/2302.03011v1
Date: Mon, 6 Feb 2023 18:50:23 GMT
Title: Structure and Content-Guided Video Synthesis with Diffusion Models
Authors: Patrick Esser, Johnathan Chiu, Parmida Atighehchian, Jonathan Granskog, Anastasis Germanidis
Abstract summary: We present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
Score: 13.464501385061032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-guided generative diffusion models unlock powerful image creation and editing tools. While these have been extended to video generation, current approaches that edit the content of existing footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model.

Related papers

VINCIE: Unlocking In-context Image Editing from Video [62.88977098700917]
In this work, we explore whether an in-context image editing model can be learned directly from videos.<n>To effectively learn from this data, we design a block-causal diffusion transformer trained on three proxy tasks.<n>Our model exhibits strong in-context image editing capabilities and achieves state-of-the-art results on two multi-turn image editing benchmarks.
arXiv Detail & Related papers (2025-06-12T17:46:54Z)
A Survey of Multimodal-Guided Image Editing with Text-to-Image Diffusion Models [117.77807994397784]
Image editing aims to edit the given synthetic or real image to meet the specific requirements from users. Recent significant advancement in this field is based on the development of text-to-image (T2I) diffusion models. T2I-based image editing methods significantly enhance editing performance and offer a user-friendly interface for modifying content guided by multimodal inputs.
arXiv Detail & Related papers (2024-06-20T17:58:52Z)
Unified Editing of Panorama, 3D Scenes, and Videos Through Disentangled Self-Attention Injection [60.47731445033151]
We propose a novel unified editing framework that combines the strengths of both approaches by utilizing only a basic 2D image text-to-image (T2I) diffusion model. Experimental results confirm that our method enables editing across diverse modalities including 3D scenes, videos, and panorama images.
arXiv Detail & Related papers (2024-05-27T04:44:36Z)
VASE: Object-Centric Appearance and Shape Manipulation of Real Videos [108.60416277357712]
In this work, we introduce a framework that is object-centric and is designed to control both the object's appearance and, notably, to execute precise and explicit structural modifications on the object. We build our framework on a pre-trained image-conditioned diffusion model, integrate layers to handle the temporal dimension, and propose training strategies and architectural modifications to enable shape control. We evaluate our method on the image-driven video editing task showing similar performance to the state-of-the-art, and showcasing novel shape-editing capabilities.
arXiv Detail & Related papers (2024-01-04T18:59:24Z)
InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot Text-based Video Editing [27.661609140918916]
InFusion is a framework for zero-shot text-based video editing. It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt. Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
arXiv Detail & Related papers (2023-07-22T17:05:47Z)
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos. Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Make-A-Protagonist: Generic Video Editing with An Ensemble of Experts [116.05656635044357]
We propose a generic video editing framework called Make-A-Protagonist. Specifically, we leverage multiple experts to parse source video, target visual and textual clues, and propose a visual-textual-based video generation model. Results demonstrate the versatile and remarkable editing capabilities of Make-A-Protagonist.
arXiv Detail & Related papers (2023-05-15T17:59:03Z)
Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame. We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.