GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation
- URL: http://arxiv.org/abs/2602.17200v1
- Date: Thu, 19 Feb 2026 09:41:32 GMT
- Title: GASS: Geometry-Aware Spherical Sampling for Disentangled Diversity Enhancement in Text-to-Image Generation
- Authors: Ye Zhu, Kaleb S. Newman, Johannes F. Lutzeyer, Adriana Romero-Soriano, Michal Drozdzal, Olga Russakovsky,
- Abstract summary: Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt.<n>We introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation.<n>Our experiments on different frozen T2I backbones and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
- Score: 32.63174739701972
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Despite high semantic alignment, modern text-to-image (T2I) generative models still struggle to synthesize diverse images from a given prompt. This lack of diversity not only restricts user choice, but also risks amplifying societal biases. In this work, we enhance the T2I diversity through a geometric lens. Unlike most existing methods that rely primarily on entropy-based guidance to increase sample dissimilarity, we introduce Geometry-Aware Spherical Sampling (GASS) to enhance diversity by explicitly controlling both prompt-dependent and prompt-independent sources of variation. Specifically, we decompose the diversity measure in CLIP embeddings using two orthogonal directions: the text embedding, which captures semantic variation related to the prompt, and an identified orthogonal direction that captures prompt-independent variation (e.g., backgrounds). Based on this decomposition, GASS increases the geometric projection spread of generated image embeddings along both axes and guides the T2I sampling process via expanded predictions along the generation trajectory. Our experiments on different frozen T2I backbones (U-Net and DiT, diffusion and flow) and benchmarks demonstrate the effectiveness of disentangled diversity enhancement with minimal impact on image fidelity and semantic alignment.
Related papers
- DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation [22.400053095939402]
We introduce DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality.<n>Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling.<n>To mitigate potential fidelity loss caused by distribution smoothing, we develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens.
arXiv Detail & Related papers (2025-12-02T16:54:36Z) - Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization [50.5332987313297]
We propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module.<n>TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution.<n>In experiments on MS-COCO and three diffusion backbones, TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality.
arXiv Detail & Related papers (2025-11-25T00:42:09Z) - Diverse Text-to-Image Generation via Contrastive Noise Optimization [60.48914865049489]
Text-to-image (T2I) diffusion models have demonstrated impressive performance in generating high-fidelity images.<n>Existing approaches typically optimize intermediate latents or text conditions during inference.<n>We introduce Contrastive Noise Optimization, a simple yet effective method that addresses the diversity issue from a distinct perspective.
arXiv Detail & Related papers (2025-10-04T13:51:32Z) - Guiding Diffusion with Deep Geometric Moments: Balancing Fidelity and Variation [35.428991756584935]
We introduce Deep Geometric Moments (DGM) as a novel form of guidance that encapsulates the subject's visual features and nuances through a learned geometric prior.<n>Our experiments demonstrate that DGM effectively balance control and diversity in diffusion-based image generation, allowing a flexible control mechanism for steering the diffusion process.
arXiv Detail & Related papers (2025-05-18T16:19:27Z) - MegaSR: Mining Customized Semantics and Expressive Guidance for Image Super-Resolution [76.30559905769859]
MegaSR mines customized block-wise semantics and expressive guidance for diffusion-based ISR.<n>We experimentally identify HED edge maps, depth maps, and segmentation maps as the most expressive guidance.<n>Extensive experiments demonstrate the superiority of MegaSR in terms of semantic richness and structural consistency.
arXiv Detail & Related papers (2025-03-11T07:00:20Z) - Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack [51.16384207202798]
Vision-language pre-training models are vulnerable to multimodal adversarial examples (AEs)
Previous approaches augment image-text pairs to enhance diversity within the adversarial example generation process.
We propose sampling from adversarial evolution triangles composed of clean, historical, and current adversarial examples to enhance adversarial diversity.
arXiv Detail & Related papers (2024-11-04T23:07:51Z) - Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation [0.0]
We introduce Diverse Diffusion, a method for boosting image diversity beyond gender and ethnicity.
Our approach contributes to the creation of more inclusive and representative AI-generated art.
arXiv Detail & Related papers (2023-10-19T08:48:23Z) - Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations [61.132408427908175]
zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain.
With only a single representative text feature instead of real images, the synthesized images gradually lose diversity.
We propose a novel method to find semantic variations of the target text in the CLIP space.
arXiv Detail & Related papers (2023-08-21T08:12:28Z) - DiverGAN: An Efficient and Effective Single-Stage Framework for Diverse
Text-to-Image Generation [7.781425222538382]
DiverGAN is a framework to generate diverse, plausible and semantically consistent images according to a natural-language description.
DiverGAN adopts two novel word-level attention modules, i.e., a channel-attention module (CAM) and a pixel-attention module (PAM)
Conditional Adaptive Instance-Layer Normalization (CAdaILN) is introduced to enable the linguistic cues from the sentence embedding to flexibly manipulate the amount of change in shape and texture.
arXiv Detail & Related papers (2021-11-17T17:59:56Z) - Rethinking conditional GAN training: An approach using geometrically
structured latent manifolds [58.07468272236356]
Conditional GANs (cGAN) suffer from critical drawbacks such as the lack of diversity in generated outputs.
We propose a novel training mechanism that increases both the diversity and the visual quality of a vanilla cGAN.
arXiv Detail & Related papers (2020-11-25T22:54:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.