Related papers: Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt

URL: http://arxiv.org/abs/2404.05331v1
Date: Mon, 8 Apr 2024 09:18:32 GMT
Title: Mask-ControlNet: Higher-Quality Image Generation with An Additional Mask Prompt
Authors: Zhiqi Huang, Huixin Xiong, Haoyu Wang, Longguang Wang, Zhiheng Li,
Abstract summary: We develop a framework termed Mask-ControlNet by introducing an additional mask prompt. We show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image.
Score: 34.880386778058075
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Text-to-image generation has witnessed great progress, especially with the recent advancements in diffusion models. Since texts cannot provide detailed conditions like object appearance, reference images are usually leveraged for the control of objects in the generated images. However, existing methods still suffer limited accuracy when the relationship between the foreground and background is complicated. To address this issue, we develop a framework termed Mask-ControlNet by introducing an additional mask prompt. Specifically, we first employ large vision models to obtain masks to segment the objects of interest in the reference image. Then, the object images are employed as additional prompts to facilitate the diffusion model to better understand the relationship between foreground and background regions during image generation. Experiments show that the mask prompts enhance the controllability of the diffusion model to maintain higher fidelity to the reference image while achieving better image quality. Comparison with previous text-to-image generation methods demonstrates our method's superior quantitative and qualitative performance on the benchmark datasets.

Related papers

SeeDiff: Off-the-Shelf Seeded Mask Generation from Diffusion Models [6.0870128457015715]
We show that cross-attention alone provides very coarse object localization, which however can provide initial seeds.<n>We also observe that a simple-text-guided synthetic image often has a uniform background, which is easier to find correspondences.<n>Our proposed method, dubbed SeeDiff, generates high-quality masks off-the-shelf from Stable Diffusion.
arXiv Detail & Related papers (2025-07-26T05:44:00Z)
AccDiffusion: An Accurate Method for Higher-Resolution Image Generation [63.53163540340026]
We propose AccDiffusion, an accurate method for patch-wise higher-resolution image generation without training. An in-depth analysis in this paper reveals an identical text prompt for different patches causes repeated object generation. Our AccDiffusion, for the first time, proposes to decouple the vanilla image-content-aware prompt into a set of patch-content-aware prompts.
arXiv Detail & Related papers (2024-07-15T14:06:29Z)
MaskDiffusion: Boosting Text-to-Image Consistency with Conditional Mask [84.84034179136458]
A crucial factor leading to the text-image mismatch issue is the inadequate cross-modality relation learning. We propose an adaptive mask, which is conditioned on the attention maps and the prompt embeddings, to dynamically adjust the contribution of each text token to the image features. Our method, termed MaskDiffusion, is training-free and hot-pluggable for popular pre-trained diffusion models.
arXiv Detail & Related papers (2023-09-08T15:53:37Z)
DiffusePast: Diffusion-based Generative Replay for Class Incremental Semantic Segmentation [73.54038780856554]
Class Incremental Semantic (CISS) extends the traditional segmentation task by incrementally learning newly added classes. Previous work has introduced generative replay, which involves replaying old class samples generated from a pre-trained GAN. We propose DiffusePast, a novel framework featuring a diffusion-based generative replay module that generates semantically accurate images with more reliable masks guided by different instructions.
arXiv Detail & Related papers (2023-08-02T13:13:18Z)
Improving Masked Autoencoders by Learning Where to Mask [65.89510231743692]
Masked image modeling is a promising self-supervised learning method for visual data. We present AutoMAE, a framework that uses Gumbel-Softmax to interlink an adversarially-trained mask generator and a mask-guided image modeling process. In our experiments, AutoMAE is shown to provide effective pretraining models on standard self-supervised benchmarks and downstream tasks.
arXiv Detail & Related papers (2023-03-12T05:28:55Z)
MaskSketch: Unpaired Structure-guided Masked Image Generation [56.88038469743742]
MaskSketch is an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure.
arXiv Detail & Related papers (2023-02-10T20:27:02Z)
SmartBrush: Text and Shape Guided Object Inpainting with Diffusion Model [27.91089554671927]
Generic image inpainting aims to complete a corrupted image by borrowing surrounding information. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance.
arXiv Detail & Related papers (2022-12-09T18:36:13Z)
DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing. Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
GANSeg: Learning to Segment by Unsupervised Hierarchical Image Generation [16.900404701997502]
We propose a GAN-based approach that generates images conditioned on latent masks. We show that such mask-conditioned image generation can be learned faithfully when conditioning the masks in a hierarchical manner. It also lets us generate image-mask pairs for training a segmentation network, which outperforms the state-of-the-art unsupervised segmentation methods on established benchmarks.
arXiv Detail & Related papers (2021-12-02T07:57:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.