InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing
- URL: http://arxiv.org/abs/2308.00135v3
- Date: Thu, 10 Aug 2023 17:50:49 GMT
- Title: InFusion: Inject and Attention Fusion for Multi Concept Zero-Shot
Text-based Video Editing
- Authors: Anant Khandelwal
- Abstract summary: InFusion is a framework for zero-shot text-based video editing.
It supports editing of multiple concepts with pixel-level control over diverse concepts mentioned in the editing prompt.
Our framework is a low-cost alternative to one-shot tuned models for editing since it does not require training.
- Score: 27.661609140918916
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large text-to-image diffusion models have achieved remarkable success in
generating diverse, high-quality images. Additionally, these models have been
successfully leveraged to edit input images by just changing the text prompt.
But when these models are applied to videos, the main challenge is to ensure
temporal consistency and coherence across frames. In this paper, we propose
InFusion, a framework for zero-shot text-based video editing leveraging large
pre-trained image diffusion models. Our framework specifically supports editing
of multiple concepts with pixel-level control over diverse concepts mentioned
in the editing prompt. Specifically, we inject the difference in features
obtained with source and edit prompts from U-Net residual blocks of decoder
layers. When these are combined with injected attention features, it becomes
feasible to query the source contents and scale edited concepts along with the
injection of unedited parts. The editing is further controlled in a
fine-grained manner with mask extraction and attention fusion, which cut the
edited part from the source and paste it into the denoising pipeline for the
editing prompt. Our framework is a low-cost alternative to one-shot tuned
models for editing since it does not require training. We demonstrated complex
concept editing with a generalised image model (Stable Diffusion v1.5) using
LoRA. Adaptation is compatible with all the existing image diffusion
techniques. Extensive experimental results demonstrate the effectiveness of
existing methods in rendering high-quality and temporally consistent videos.
Related papers
- Enhancing Text-to-Image Editing via Hybrid Mask-Informed Fusion [61.42732844499658]
This paper systematically improves the text-guided image editing techniques based on diffusion models.
We incorporate human annotation as an external knowledge to confine editing within a Mask-informed'' region.
arXiv Detail & Related papers (2024-05-24T07:53:59Z) - DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image
Editing [66.43179841884098]
Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years.
We propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing.
Our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks.
arXiv Detail & Related papers (2024-02-04T18:50:29Z) - Forgedit: Text Guided Image Editing via Learning and Forgetting [17.26772361532044]
We design a novel text-guided image editing method, named as Forgedit.
First, we propose a vision-language joint optimization framework capable of reconstructing the original image in 30 seconds.
Then, we propose a novel vector projection mechanism in text embedding space of Diffusion Models.
arXiv Detail & Related papers (2023-09-19T12:05:26Z) - StableVideo: Text-driven Consistency-aware Diffusion Video Editing [24.50933856309234]
Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their appearance over time.
This paper introduces temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the edited objects.
We build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing.
arXiv Detail & Related papers (2023-08-18T14:39:16Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - Pix2Video: Video Editing using Image Diffusion [43.07444438561277]
We investigate how to use pre-trained image models for text-guided video editing.
Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame.
We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning.
arXiv Detail & Related papers (2023-03-22T16:36:10Z) - FateZero: Fusing Attentions for Zero-shot Text-based Video Editing [104.27329655124299]
We propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask.
Our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model.
arXiv Detail & Related papers (2023-03-16T17:51:13Z) - Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video.
The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection.
We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.