DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models
- URL: http://arxiv.org/abs/2511.21415v1
- Date: Wed, 26 Nov 2025 14:06:52 GMT
- Title: DiverseVAR: Balancing Diversity and Quality of Next-Scale Visual Autoregressive Models
- Authors: Mingue Park, Prin Phunyaphibarn, Phillip Y. Lee, Minhyuk Sung,
- Abstract summary: We introduce Diverse VAR, a framework that enhances the diversity of text-conditioned visual autoregressive models ( VAR) at test time.<n>Var models have emerged as strong competitors to diffusion and flow models for image generation.<n>Var models suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts.
- Score: 23.12099227251494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DiverseVAR, a framework that enhances the diversity of text-conditioned visual autoregressive models (VAR) at test time without requiring retraining, fine-tuning, or substantial computational overhead. While VAR models have recently emerged as strong competitors to diffusion and flow models for image generation, they suffer from a critical limitation in diversity, often producing nearly identical images even for simple prompts. This issue has largely gone unnoticed amid the predominant focus on image quality. We address this limitation at test time in two stages. First, inspired by diversity enhancement techniques in diffusion models, we propose injecting noise into the text embedding. This introduces a trade-off between diversity and image quality: as diversity increases, the image quality sharply declines. To preserve quality, we propose scale-travel: a novel latent refinement technique inspired by time-travel strategies in diffusion models. Specifically, we use a multi-scale autoencoder to extract coarse-scale tokens that enable us to resume generation at intermediate stages. Extensive experiments show that combining text-embedding noise injection with our scale-travel refinement significantly enhances diversity while minimizing image-quality degradation, achieving a new Pareto frontier in the diversity-quality trade-off.
Related papers
- DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO [50.89703227426486]
Reinforcement learning (RL) improves image generation quality significantly by comparing the relative performance of images generated within the same group.<n>In the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity.<n>This issue can be analyzed from both reward modeling and generation dynamics perspectives.
arXiv Detail & Related papers (2025-12-25T05:37:37Z) - DiverseAR: Boosting Diversity in Bitwise Autoregressive Image Generation [22.400053095939402]
We introduce DiverseAR, a principled and effective method that enhances image diversity without sacrificing visual quality.<n>Specifically, we introduce an adaptive logits distribution scaling mechanism that dynamically adjusts the sharpness of the binary output distribution during sampling.<n>To mitigate potential fidelity loss caused by distribution smoothing, we develop an energy-based generation path search algorithm that avoids sampling low-confidence tokens.
arXiv Detail & Related papers (2025-12-02T16:54:36Z) - Training-Free Generation of Diverse and High-Fidelity Images via Prompt Semantic Space Optimization [50.5332987313297]
We propose Token-Prompt embedding Space Optimization (TPSO), a training-free and model-agnostic module.<n>TPSO introduces learnable parameters to explore underrepresented regions of the token embedding space, reducing the tendency of the model to repeatedly generate samples from strong modes of the learned distribution.<n>In experiments on MS-COCO and three diffusion backbones, TPSO significantly enhances generative diversity, improving baseline performance from 1.10 to 4.18 points, without sacrificing image quality.
arXiv Detail & Related papers (2025-11-25T00:42:09Z) - Diversity Has Always Been There in Your Visual Autoregressive Models [78.27363151940996]
Visual Autoregressive ( VAR) models have recently garnered significant attention for their innovative next-scale prediction paradigm.<n>Despite their efficiency, VAR models often suffer from the diversity collapse, analogous to that observed in few-step distilled diffusion models.<n>We introduce Diverse VAR, a simple yet effective approach that restores the generative diversity of VAR models without requiring any additional training.
arXiv Detail & Related papers (2025-11-21T09:24:09Z) - ImageReFL: Balancing Quality and Diversity in Human-Aligned Diffusion Models [2.712399554918533]
Reward-based fine-tuning using models trained on human feedback improves alignment but often harms diversity, producing less varied outputs.<n>We introduce textitcombined generation, a novel sampling strategy that applies a reward-tuned diffusion model only in the later stages of the generation process.<n>Second, we propose textitImageReFL, a fine-tuning method that improves image diversity with minimal loss in quality by training on real images.
arXiv Detail & Related papers (2025-05-28T16:45:07Z) - SGD-Mix: Enhancing Domain-Specific Image Classification with Label-Preserving Data Augmentation [0.6554326244334868]
We propose a novel framework that explicitly integrates diversity, faithfulness, and label clarity into the augmentation process.<n>Our approach employs saliency-guided mixing and a fine-tuned diffusion model to preserve foreground semantics, enrich background diversity, and ensure label consistency.
arXiv Detail & Related papers (2025-05-17T03:51:18Z) - Unified Multimodal Discrete Diffusion [78.48930545306654]
Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches.<n>We explore discrete diffusion models as a unified generative formulation in the joint text and image domain.<n>We present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images.
arXiv Detail & Related papers (2025-03-26T17:59:51Z) - HypDAE: Hyperbolic Diffusion Autoencoders for Hierarchical Few-shot Image Generation [32.16985870309231]
Few-shot image generation aims to generate diverse and high-quality images for an unseen class given only a few examples in that class.<n>We propose Hyperbolic Diffusion Autoencoders (HypDAE), a novel approach that operates in hyperbolic space to capture hierarchical relationships among images from seen categories.
arXiv Detail & Related papers (2024-11-27T00:45:51Z) - MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling [64.09238330331195]
We propose a novel Multi-Modal Auto-Regressive (MMAR) probabilistic modeling framework.<n>Unlike discretization line of method, MMAR takes in continuous-valued image tokens to avoid information loss in an efficient way.<n>We also propose a theoretically proven technique that addresses the numerical stability issue and a training strategy that balances the generation and understanding task goals.
arXiv Detail & Related papers (2024-10-14T17:57:18Z) - Kaleido Diffusion: Improving Conditional Diffusion Models with Autoregressive Latent Modeling [49.41822427811098]
We present Kaleido, a novel approach that enhances the diversity of samples by incorporating autoregressive latent priors.
Kaleido integrates an autoregressive language model that encodes the original caption and generates latent variables.
We show that Kaleido adheres closely to the guidance provided by the generated latent variables, demonstrating its capability to effectively control and direct the image generation process.
arXiv Detail & Related papers (2024-05-31T17:41:11Z) - Diverse Diffusion: Enhancing Image Diversity in Text-to-Image Generation [0.0]
We introduce Diverse Diffusion, a method for boosting image diversity beyond gender and ethnicity.
Our approach contributes to the creation of more inclusive and representative AI-generated art.
arXiv Detail & Related papers (2023-10-19T08:48:23Z) - Effective Data Augmentation With Diffusion Models [45.18188726287581]
We address the lack of diversity in data augmentation with image-to-image transformations parameterized by pre-trained text-to-image diffusion models.<n>Our method edits images to change their semantics using an off-the-shelf diffusion model, and generalizes to novel visual concepts from a few labelled examples.<n>We evaluate our approach on few-shot image classification tasks, and on a real-world weed recognition task, and observe an improvement in accuracy in tested domains.
arXiv Detail & Related papers (2023-02-07T20:42:28Z) - Auto-regressive Image Synthesis with Integrated Quantization [55.51231796778219]
This paper presents a versatile framework for conditional image generation.
It incorporates the inductive bias of CNNs and powerful sequence modeling of auto-regression.
Our method achieves superior diverse image generation performance as compared with the state-of-the-art.
arXiv Detail & Related papers (2022-07-21T22:19:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.