Related papers: Lazy Diffusion Transformer for Interactive Image Editing

Lazy Diffusion Transformer for Interactive Image Editing

URL: http://arxiv.org/abs/2404.12382v1
Date: Thu, 18 Apr 2024 17:59:27 GMT
Title: Lazy Diffusion Transformer for Interactive Image Editing
Authors: Yotam Nitzan, Zongze Wu, Richard Zhang, Eli Shechtman, Daniel Cohen-Or, Taesung Park, Michaël Gharbi,
Abstract summary: We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications.
Score: 79.75128130739598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently. Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications using binary masks and text prompts. Our generator operates in two phases. First, a context encoder processes the current canvas and user mask to produce a compact global context tailored to the region to generate. Second, conditioned on this context, a diffusion-based transformer decoder synthesizes the masked pixels in a "lazy" fashion, i.e., it only generates the masked region. This contrasts with previous works that either regenerate the full canvas, wasting time and computation, or confine processing to a tight rectangular crop around the mask, ignoring the global image context altogether. Our decoder's runtime scales with the mask size, which is typically small, while our encoder introduces negligible overhead. We demonstrate that our approach is competitive with state-of-the-art inpainting methods in terms of quality and fidelity while providing a 10x speedup for typical user interactions, where the editing mask represents 10% of the image.

Related papers

ART: Anonymous Region Transformer for Variable Multi-Layer Transparent Image Generation [108.69315278353932]
We introduce the Anonymous Region Transformer (ART), which facilitates the direct generation of variable multi-layer transparent images. By enabling precise control and scalable layer generation, ART establishes a new paradigm for interactive content creation.
arXiv Detail & Related papers (2025-02-25T16:57:04Z)
FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem. It only uses a lightweight sparse control encoder to inject editing signals. It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z)
SegTalker: Segmentation-based Talking Face Generation with Mask-guided Local Editing [19.245228801339007]
We propose a novel framework called SegTalker to decouple lip movements and image textures. We disentangle semantic regions of image into style codes using a mask-guided encoder. Ultimately, we inject the previously generated talking segmentation and style codes into a mask-guided StyleGAN to synthesize video frame.
arXiv Detail & Related papers (2024-09-05T15:11:40Z)
Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis [63.757624792753205]
We present Zero-Painter, a framework for layout-conditional text-to-image synthesis. Our method utilizes object masks and individual descriptions, coupled with a global text prompt, to generate images with high fidelity.
arXiv Detail & Related papers (2024-06-06T13:02:00Z)
MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers [30.924202893340087]
State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. This paper breaks down the text-based video editing task into two stages. First, we leverage an pre-trained text-to-image diffusion model to simultaneously edit fews in a zero-shot way. Second, we introduce an efficient model called MaskINT, which is built on non-autoregressive masked generative transformers.
arXiv Detail & Related papers (2023-12-19T07:05:39Z)
MaskSketch: Unpaired Structure-guided Masked Image Generation [56.88038469743742]
MaskSketch is an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure.
arXiv Detail & Related papers (2023-02-10T20:27:02Z)
DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing. Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
MaskGIT: Masked Generative Image Transformer [49.074967597485475]
MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. Experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset.
arXiv Detail & Related papers (2022-02-08T23:54:06Z)
SketchEdit: Mask-Free Local Image Manipulation with Partial Sketches [95.45728042499836]
We propose a new paradigm of sketch-based image manipulation: mask-free local image manipulation. Our model automatically predicts the target modification region and encodes it into a structure style vector. A generator then synthesizes the new image content based on the style vector and sketch.
arXiv Detail & Related papers (2021-11-30T02:42:31Z)
Iterative Facial Image Inpainting using Cyclic Reverse Generator [0.913755431537592]
Cyclic Reverse Generator (CRG) architecture provides an encoder-generator model. We empirically observed that only a few iterations are sufficient to generate realistic images with the proposed model. Our method allows applying sketch-based inpaintings, using variety of mask types, and producing multiple and diverse results.
arXiv Detail & Related papers (2021-01-18T12:19:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.