Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization
- URL: http://arxiv.org/abs/2511.19811v1
- Date: Tue, 25 Nov 2025 00:42:09 GMT
- Title: Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization
- Authors: Debin Meng, Chen Jin, Zheng Gao, Yanran Li, Ioannis Patras, Georgios Tzimiropoulos,
- Abstract summary: We propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module.<n>TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution.<n>In experiments on MS-COCO and three diffusion backbones, TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality.
- Score: 50.5332987313297
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image diversity remains a fundamental challenge for text-to-image diffusion models. Low-diversity models tend to generate repetitive outputs, increasing sampling redundancy and hindering both creative exploration and downstream applications. A primary cause is that generation often collapses toward a strong mode in the learned distribution. Existing attempts to improve diversity, such as noise resampling, prompt rewriting, or steering-based guidance, often still collapse to dominant modes or introduce distortions that degrade image quality. In light of this, we propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module. TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution. At the same time, the prompt-level space provides a global semantic constraint that regulates distribution shifts, preventing quality degradation while maintaining high fidelity. Extensive experiments on MS-COCO and three diffusion backbones show that TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality. Code will be released upon acceptance.
Related papers
- Beyond the Dirac Delta: Mitigating Diversity Collapse in Reinforcement Fine-Tuning for Versatile Image Generation [51.305316234962554]
We propose textbfDRIFT (textbfDivetextbfRsity-textbfIncentivized Reinforcement textbfFine-textbfTuning for Versatile Image Generation), an innovative framework that systematically incentivizes output throughout the on-policy fine-tuning process.<n>DRIFT achieves superior dominance regarding task alignment and generation diversity, yielding a $ 9.08%!sim! 43.46%$ increase in diversity equivalent alignment levels and a $ 59.65
arXiv Detail & Related papers (2026-01-18T13:25:43Z) - DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO [50.89703227426486]
Reinforcement learning (RL) improves image generation quality significantly by comparing the relative performance of images generated within the same group.<n>In the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity.<n>This issue can be analyzed from both reward modeling and generation dynamics perspectives.
arXiv Detail & Related papers (2025-12-25T05:37:37Z) - DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation [22.400053095939402]
We introduce DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality.<n>Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling.<n>To mitigate potential fidelity loss caused by distribution smoothing, we develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens.
arXiv Detail & Related papers (2025-12-02T16:54:36Z) - DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models [23.12099227251494]
We introduce Diverse VAR, a framework that enhances the diversity of text-conditioned visual autoregressive models ( VAR) at test time.<n>Var models have emerged as strong competitors to diffusion and flow models for image generation.<n>Var models suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts.
arXiv Detail & Related papers (2025-11-26T14:06:52Z) - PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling [29.17316505041238]
Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but suppresses diversity, leading to repetitive and less creative outputs.<n>We propose PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics.<n>Experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
arXiv Detail & Related papers (2025-11-25T12:25:41Z) - Hawk: Leveraging Spatial Context for Faster Autoregressive Text-to-Image Generation [87.00172597953228]
Speculative decoding has shown promise in accelerating text generation without compromising quality.<n>We introduce Hawk, a new approach that harnesses the spatial structure of images to guide the speculative model toward more accurate and efficient predictions.<n> Experimental results on multiple text-to-image benchmarks demonstrate a 1.71x speedup over standard AR models.
arXiv Detail & Related papers (2025-10-29T17:43:31Z) - ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models [2.712399554918533]
Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs.<n>We introduce textitcombined generation, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process.<n>Second, we propose textitImageReFL, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images.
arXiv Detail & Related papers (2025-05-28T16:45:07Z) - Boosting Generative Image Modeling via Joint Image-Feature Synthesis [15.133906625258797]
We introduce a novel generative image modeling framework that seamlessly bridges the gap by leveraging a diffusion model to jointly model low-level image latents.<n>Our latent-semantic diffusion approach learns to generate coherent image-feature pairs from pure noise.<n>By eliminating the need for complex distillation objectives, our unified design simplifies training and unlocks a powerful new inference strategy: Representation Guidance.
arXiv Detail & Related papers (2025-04-22T17:41:42Z) - Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - Coherent and Multi-modality Image Inpainting via Latent Space Optimization [61.99406669027195]
PILOT (intextbfPainting vtextbfIa textbfLatent textbfOptextbfTimization) is an optimization approach grounded on a novel textitsemantic centralization and textitbackground preservation loss.
Our method searches latent spaces capable of generating inpainted regions that exhibit high fidelity to user-provided prompts while maintaining coherence with the background.
arXiv Detail & Related papers (2024-07-10T19:58:04Z) - Few-Shot Image Generation by Conditional Relaxing Diffusion Inversion [37.18537753482751]
Conditional Diffusion Relaxing Inversion (CRDI) is designed to enhance distribution diversity in synthetic image generation.
CRDI does not rely on fine-tuning based on only a few samples.
It focuses on reconstructing each target image instance and expanding diversity through few-shot learning.
arXiv Detail & Related papers (2024-07-09T21:58:26Z) - DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior [70.46245698746874]
We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks.
DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content.
In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results.
For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details.
arXiv Detail & Related papers (2023-08-29T07:11:52Z) - SinDiffusion: Learning a Diffusion Model from a Single Natural Image [159.4285444680301]
We present SinDiffusion, leveraging denoising diffusion models to capture internal distribution of patches from a single natural image.
It is based on two core designs. First, SinDiffusion is trained with a single model at a single scale instead of multiple models with progressive growing of scales.
Second, we identify that a patch-level receptive field of the diffusion network is crucial and effective for capturing the image's patch statistics.
arXiv Detail & Related papers (2022-11-22T18:00:03Z) - Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation.
It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression.
Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.