HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
- URL: http://arxiv.org/abs/2409.02919v3
- Date: Mon, 9 Sep 2024 09:11:28 GMT
- Title: HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts
- Authors: Xinyu Liu, Yingqing He, Lanqing Guo, Xiang Li, Bu Jin, Peng Li, Yan Li, Chi-Min Chan, Qifeng Chen, Wei Xue, Wenhan Luo, Qifeng Liu, Yike Guo,
- Abstract summary: HiPrompt is a tuning-free solution for higher-resolution image generation.
hierarchical prompts offer both global and local guidance.
generated images maintain coherent local and global semantics, structures, and textures with high definition.
- Score: 77.62320553269615
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The potential for higher-resolution image generation using pretrained diffusion models is immense, yet these models often struggle with issues of object repetition and structural artifacts especially when scaling to 4K resolution and higher. We figure out that the problem is caused by that, a single prompt for the generation of multiple scales provides insufficient efficacy. In response, we propose HiPrompt, a new tuning-free solution that tackles the above problems by introducing hierarchical prompts. The hierarchical prompts offer both global and local guidance. Specifically, the global guidance comes from the user input that describes the overall content, while the local guidance utilizes patch-wise descriptions from MLLMs to elaborately guide the regional structure and texture generation. Furthermore, during the inverse denoising process, the generated noise is decomposed into low- and high-frequency spatial components. These components are conditioned on multiple prompt levels, including detailed patch-wise descriptions and broader image-level prompts, facilitating prompt-guided denoising under hierarchical semantic guidance. It further allows the generation to focus more on local spatial regions and ensures the generated images maintain coherent local and global semantics, structures, and textures with high definition. Extensive experiments demonstrate that HiPrompt outperforms state-of-the-art works in higher-resolution image generation, significantly reducing object repetition and enhancing structural quality.
Related papers
- ResMaster: Mastering High-Resolution Image Generation via Structural and Fine-Grained Guidance [46.64836025290448]
ResMaster is a training-free method that empowers resolution-limited diffusion models to generate high-quality images beyond resolution restrictions.
It provides structural and fine-grained guidance for crafting high-resolution images on a patch-by-patch basis.
Experiments validate that ResMaster sets a new benchmark for high-resolution image generation and demonstrates promising efficiency.
arXiv Detail & Related papers (2024-06-24T09:28:21Z) - GLoD: Composing Global Contexts and Local Details in Image Generation [0.0]
Global-Local Diffusion (textitGLoD) is a novel framework which allows simultaneous control over the global contexts and the local details.
It assigns multiple global and local prompts to corresponding layers and composes their noises to guide a denoising process.
Our framework enables complex global-local compositions, conditioning objects in the global prompt with the local prompts while preserving other unspecified identities.
arXiv Detail & Related papers (2024-04-23T18:39:57Z) - ST-LDM: A Universal Framework for Text-Grounded Object Generation in Real Images [9.906943507715779]
We present a novel image editing scenario termed Text-grounded Object Generation (TOG)
We propose a universal framework ST-LDM based on Swin-Transformer.
Our model enhances the localization of attention mechanisms while preserving the generative capabilities inherent to diffusion models.
arXiv Detail & Related papers (2024-03-15T04:02:31Z) - CM-GAN: Image Inpainting with Cascaded Modulation GAN and Object-Aware
Training [112.96224800952724]
We propose cascaded modulation GAN (CM-GAN) to generate plausible image structures when dealing with large holes in complex images.
In each decoder block, global modulation is first applied to perform coarse semantic-aware synthesis structure, then spatial modulation is applied on the output of global modulation to further adjust the feature map in a spatially adaptive fashion.
In addition, we design an object-aware training scheme to prevent the network from hallucinating new objects inside holes, fulfilling the needs of object removal tasks in real-world scenarios.
arXiv Detail & Related papers (2022-03-22T16:13:27Z) - PINs: Progressive Implicit Networks for Multi-Scale Neural
Representations [68.73195473089324]
We propose a progressive positional encoding, exposing a hierarchical structure to incremental sets of frequency encodings.
Our model accurately reconstructs scenes with wide frequency bands and learns a scene representation at progressive level of detail.
Experiments on several 2D and 3D datasets show improvements in reconstruction accuracy, representational capacity and training speed compared to baselines.
arXiv Detail & Related papers (2022-02-09T20:33:37Z) - GIU-GANs: Global Information Utilization for Generative Adversarial
Networks [3.3945834638760948]
In this paper, we propose a new GANs called Involution Generative Adversarial Networks (GIU-GANs)
GIU-GANs leverages a brand new module called the Global Information Utilization (GIU) module, which integrates Squeeze-and-Excitation Networks (SENet) and involution.
Batch Normalization(BN) inevitably ignores the representation differences among noise sampled by the generator, and thus degrades the generated image quality.
arXiv Detail & Related papers (2022-01-25T17:17:15Z) - High-resolution Depth Maps Imaging via Attention-based Hierarchical
Multi-modal Fusion [84.24973877109181]
We propose a novel attention-based hierarchical multi-modal fusion network for guided DSR.
We show that our approach outperforms state-of-the-art methods in terms of reconstruction accuracy, running speed and memory efficiency.
arXiv Detail & Related papers (2021-04-04T03:28:33Z) - Efficient texture-aware multi-GAN for image inpainting [5.33024001730262]
Recent GAN-based (Generative adversarial networks) inpainting methods show remarkable improvements.
We propose a multi-GAN architecture improving both the performance and rendering efficiency.
arXiv Detail & Related papers (2020-09-30T14:58:03Z) - Local Class-Specific and Global Image-Level Generative Adversarial
Networks for Semantic-Guided Scene Generation [135.4660201856059]
We consider learning the scene generation in a local context, and design a local class-specific generative network with semantic maps as a guidance.
To learn more discrimi class-specific feature representations for the local generation, a novel classification module is also proposed.
Experiments on two scene image generation tasks show superior generation performance of the proposed model.
arXiv Detail & Related papers (2019-12-27T16:14:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.