Related papers: PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling

URL: http://arxiv.org/abs/2511.20251v1
Date: Tue, 25 Nov 2025 12:25:41 GMT
Title: PromptMoG: Enhancing Diversity in Long-Prompt Image Generation via Prompt Embedding Mixture-of-Gaussian Sampling
Authors: Bo-Kai Ruan, Teng-Fang Hsiao, Ling Lo, Yi-Lun Wu, Hong-Han Shuai,
Abstract summary: Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but suppresses diversity, leading to repetitive and less creative outputs.<n>We propose PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics.<n>Experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.
Score: 29.17316505041238
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in text-to-image (T2I) generation have achieved remarkable visual outcomes through large-scale rectified flow models. However, how these models behave under long prompts remains underexplored. Long prompts encode rich content, spatial, and stylistic information that enhances fidelity but often suppresses diversity, leading to repetitive and less creative outputs. In this work, we systematically study this fidelity-diversity dilemma and reveal that state-of-the-art models exhibit a clear drop in diversity as prompt length increases. To enable consistent evaluation, we introduce LPD-Bench, a benchmark designed for assessing both fidelity and diversity in long-prompt generation. Building on our analysis, we develop a theoretical framework that increases sampling entropy through prompt reformulation and propose a training-free method, PromptMoG, which samples prompt embeddings from a Mixture-of-Gaussians in the embedding space to enhance diversity while preserving semantics. Extensive experiments on four state-of-the-art models, SD3.5-Large, Flux.1-Krea-Dev, CogView4, and Qwen-Image, demonstrate that PromptMoG consistently improves long-prompt generation diversity without semantic drifting.

Related papers

DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO [50.89703227426486]
Reinforcement learning (RL) improves image generation quality significantly by comparing the relative performance of images generated within the same group.<n>In the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity.<n>This issue can be analyzed from both reward modeling and generation dynamics perspectives.
arXiv Detail & Related papers (2025-12-25T05:37:37Z)
DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models [23.12099227251494]
We introduce Diverse VAR, a framework that enhances the diversity of text-conditioned visual autoregressive models ( VAR) at test time.<n>Var models have emerged as strong competitors to diffusion and flow models for image generation.<n>Var models suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts.
arXiv Detail & Related papers (2025-11-26T14:06:52Z)
Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization [50.5332987313297]
We propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module.<n>TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution.<n>In experiments on MS-COCO and three diffusion backbones, TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality.
arXiv Detail & Related papers (2025-11-25T00:42:09Z)
Evolve to Inspire: Novelty Search for Diverse Image Generation [6.040326113136291]
We introduce WANDER, a novelty search-based approach to generating diverse sets of images from a single input prompt.<n>We employ a Large Language Model (LLM) for semantic evolution of diverse sets of images, and using CLIP embeddings to quantify novelty.<n>We additionally apply emitters to guide the search into distinct regions of the prompt space, and demonstrate that they boost the diversity of the generated images.
arXiv Detail & Related papers (2025-11-01T19:58:07Z)
Open Multimodal Retrieval-Augmented Factual Image Generation [86.34546873830152]
We introduce ORIG, an agentic open multimodal retrieval-augmented framework for Factual Image Generation (FIG)<n> ORIG iteratively retrieves and filters multimodal evidence from the web and incrementally integrates the refined knowledge into enriched prompts to guide generation.<n>Experiments demonstrate that ORIG substantially improves factual consistency and overall image quality over strong baselines.
arXiv Detail & Related papers (2025-10-26T04:13:31Z)
Towards Better & Faster Autoregressive Image Generation: From the Perspective of Entropy [23.573364375818553]
This work revisits the sampling issues in current autoregressive (AR) image generation models.<n>We identify that image tokens, unlike text tokens, exhibit lower information density and non-uniform spatial distribution.<n>We present an entropy-informed decoding strategy that facilitates higher autoregressive generation quality with faster synthesis speed.
arXiv Detail & Related papers (2025-10-10T05:26:11Z)
Lavida-O: Elastic Large Masked Diffusion Models for Unified Multimodal Understanding and Generation [63.50827603618498]
We propose Lavida-O, a unified Masked Diffusion Model (MDM) for multimodal understanding and generation.<n>Lavida-O presents a single framework that enables image-level understanding, object grounding, image editing, and high-resolution text-to-image synthesis.<n>Lavida-O achieves state-of-the-art performance on a wide range of benchmarks including RefCOCO object grounding, GenEval text-to-image generation, and ImgEdit image editing.
arXiv Detail & Related papers (2025-09-23T17:05:46Z)
SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation [0.6554326244334868]
We propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process.<n>Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency.
arXiv Detail & Related papers (2025-05-17T03:51:18Z)
Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.<n>We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.<n>The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z)
Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling [49.41822427811098]
We present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors. Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables. We show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
arXiv Detail & Related papers (2024-05-31T17:41:11Z)
Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation. It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression. Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.