Creatively Upscaling Images with Global-Regional Priors
- URL: http://arxiv.org/abs/2505.16976v1
- Date: Thu, 22 May 2025 17:51:50 GMT
- Title: Creatively Upscaling Images with Global-Regional Priors
- Authors: Yurui Qian, Qi Cai, Yingwei Pan, Ting Yao, Tao Mei,
- Abstract summary: C-Upscale is a new recipe of tuning-free image upscaling.<n>It pivots on global-regional priors derived from given global prompt and estimated regional prompts.<n>It generates ultra-high-resolution images with higher visual fidelity and more creative regional details.
- Score: 98.24171965992916
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contemporary diffusion models show remarkable capability in text-to-image generation, while still being limited to restricted resolutions (e.g., 1,024 X 1,024). Recent advances enable tuning-free higher-resolution image generation by recycling pre-trained diffusion models and extending them via regional denoising or dilated sampling/convolutions. However, these models struggle to simultaneously preserve global semantic structure and produce creative regional details in higher-resolution images. To address this, we present C-Upscale, a new recipe of tuning-free image upscaling that pivots on global-regional priors derived from given global prompt and estimated regional prompts via Multimodal LLM. Technically, the low-frequency component of low-resolution image is recognized as global structure prior to encourage global semantic consistency in high-resolution generation. Next, we perform regional attention control to screen cross-attention between global prompt and each region during regional denoising, leading to regional attention prior that alleviates object repetition issue. The estimated regional prompts containing rich descriptive details further act as regional semantic prior to fuel the creativity of regional detail generation. Both quantitative and qualitative evaluations demonstrate that our C-Upscale manages to generate ultra-high-resolution images (e.g., 4,096 X 4,096 and 8,192 X 8,192) with higher visual fidelity and more creative regional details.
Related papers
- From Missing Pieces to Masterpieces: Image Completion with Context-Adaptive Diffusion [98.31811240195324]
ConFill is a novel framework that reduces discrepancies between generated and original images at each diffusion step.<n>It outperforms current methods, setting a new benchmark in image completion.
arXiv Detail & Related papers (2025-04-19T13:40:46Z) - Can Location Embeddings Enhance Super-Resolution of Satellite Imagery? [2.3020018305241337]
Publicly available satellite imagery, such as Sentinel- 2, often lacks the spatial resolution required for accurate analysis of remote sensing tasks.<n>We propose a novel super-resolution framework that enhances generalization by incorporating geographic context through location embeddings.<n>We demonstrate the effectiveness of our method on the building segmentation task, showing significant improvements over state-of-the-art methods.
arXiv Detail & Related papers (2025-01-27T08:16:54Z) - Region-Aware Text-to-Image Generation via Hard Binding and Soft Refinement [40.94329069897935]
We present RAG, a Regional-Aware text-to-image Generation method conditioned on regional descriptions for precise layout composition.
RAG achieves superior performance over attribute binding and object relationship than previous tuning-free methods.
arXiv Detail & Related papers (2024-11-10T18:45:41Z) - HiPrompt: Tuning-free Higher-Resolution Generation with Hierarchical MLLM Prompts [77.62320553269615]
HiPrompt is a tuning-free solution for higher-resolution image generation.
hierarchical prompts offer both global and local guidance.
generated images maintain coherent local and global semantics, structures, and textures with high definition.
arXiv Detail & Related papers (2024-09-04T17:58:08Z) - Zero-shot Text-guided Infinite Image Synthesis with LLM guidance [2.531998650341267]
There is a lack of text-image paired datasets with high-resolution and contextual diversity.<n>Expanding images based on text requires global coherence and rich local context understanding.<n>We propose a novel approach utilizing Large Language Models (LLMs) for both global coherence and local context understanding.
arXiv Detail & Related papers (2024-07-17T15:10:01Z) - Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - RegionGPT: Towards Region Understanding Vision Language Model [88.42271128373191]
RegionGPT (short as RGPT) is a novel framework designed for complex region-level captioning and understanding.
We develop an automated region caption data generation pipeline, enriching the training set with detailed region-level captions.
We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks.
arXiv Detail & Related papers (2024-03-04T18:58:08Z) - Efficient and Explicit Modelling of Image Hierarchies for Image
Restoration [120.35246456398738]
We propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration.
Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention.
Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range.
arXiv Detail & Related papers (2023-03-01T18:59:29Z) - GIU-GANs: Global Information Utilization for Generative Adversarial
Networks [3.3945834638760948]
In this paper, we propose a new GANs called Involution Generative Adversarial Networks (GIU-GANs)
GIU-GANs leverages a brand new module called the Global Information Utilization (GIU) module, which integrates Squeeze-and-Excitation Networks (SENet) and involution.
Batch Normalization(BN) inevitably ignores the representation differences among noise sampled by the generator, and thus degrades the generated image quality.
arXiv Detail & Related papers (2022-01-25T17:17:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.