Jointly Reinforcing Diversity and Quality in Language Model Generations
- URL: http://arxiv.org/abs/2509.02534v1
- Date: Tue, 02 Sep 2025 17:38:47 GMT
- Title: Jointly Reinforcing Diversity and Quality in Language Model Generations
- Authors: Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang,
- Abstract summary: Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity.<n>We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimize for response quality and semantic diversity.
- Score: 64.72289248044514
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.
Related papers
- DPWriter: Reinforcement Learning with Diverse Planning Branching for Creative Writing [78.70918589095639]
Reinforcement learning (RL)-based enhancement of large language models (LLMs) often leads to reduced output diversity.<n>This paper proposes an RL framework structured around a semi-structured long Chain-of-Thought (CoT)<n>We introduce a Diverse Planning Branching method that strategically introduces divergence at the planning phase based on diversity variation.
arXiv Detail & Related papers (2026-01-14T16:30:20Z) - DiverseGRPO: Mitigating Mode Collapse in Image Generation via Diversity-Aware GRPO [50.89703227426486]
Reinforcement learning (RL) improves image generation quality significantly by comparing the relative performance of images generated within the same group.<n>In the later stages of training, the model tends to produce homogenized outputs, lacking creativity and visual diversity.<n>This issue can be analyzed from both reward modeling and generation dynamics perspectives.
arXiv Detail & Related papers (2025-12-25T05:37:37Z) - Optimizing Diversity and Quality through Base-Aligned Model Collaboration [49.59542918674004]
We propose Base-Aligned Model Collaboration (BACo) to optimize diversity and quality.<n>BACo employs routing strategies that determine, at each token, from which model to decode.<n>BACo achieves both high diversity and quality post hoc within a single pass, while offering strong controllability.
arXiv Detail & Related papers (2025-11-07T19:00:01Z) - G2: Guided Generation for Enhanced Output Diversity in LLMs [22.52615993477612]
Large Language Models (LLMs) have demonstrated exceptional performance across diverse natural language processing tasks.<n>LLMs exhibit a critical limitation in output diversity, often generating highly similar content across multiple attempts.<n>We propose Guide-to-Generation (G2), a training-free plug-and-play method that enhances output diversity while preserving generation quality.
arXiv Detail & Related papers (2025-11-01T07:13:28Z) - Post-training Large Language Models for Diverse High-Quality Responses [32.92680825196664]
Reinforcement learning (RL) has emerged as a popular method for post-training large language models (LLMs)<n>We propose a novel training method named DQO (Diversity Quality Optimization) based on determinantal point processes (DPPs)<n>Our approach samples and embeds a group of responses for each prompt, then uses the determinant of a kernel-based similarity matrix to measure diversity as the volume spanned by the embeddings of these responses.
arXiv Detail & Related papers (2025-09-05T03:47:06Z) - Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs [102.48588475875749]
We introduce Generative Self-Refinement (GSR), a novel parallel test-time scaling framework.<n>GSR generates a set of candidate responses in parallel and then performs self-refinement to synthesize a new superior solution.<n>We show that our method achieves state-of-the-art performance across five mathematical benchmarks.
arXiv Detail & Related papers (2025-08-27T06:51:48Z) - Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.<n>Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.<n>These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z) - Modifying Large Language Model Post-Training for Diverse Creative Writing [12.872333448726595]
In creative writing generation, we investigate post-training approaches to promote both output diversity and quality.<n>Our core idea is to include deviation in the training objective to facilitate learning from rare high-quality instances.<n>Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models.
arXiv Detail & Related papers (2025-03-21T13:21:45Z) - Generate, Discriminate, Evolve: Enhancing Context Faithfulness via Fine-Grained Sentence-Level Self-Evolution [61.80716438091887]
GenDiE (Generate, Discriminate, Evolve) is a novel self-evolving framework that enhances context faithfulness through fine-grained sentence-level optimization.<n>By treating each sentence in a response as an independent optimization unit, GenDiE effectively addresses the limitations of previous approaches.<n>Experiments on ASQA (in-domain LFQA) and ConFiQA datasets demonstrate that GenDiE surpasses various baselines in both faithfulness and correctness.
arXiv Detail & Related papers (2025-03-03T16:08:33Z) - Diversity-Rewarded CFG Distillation [62.08448835625036]
We introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations.
Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt.
arXiv Detail & Related papers (2024-10-08T14:40:51Z) - Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning [28.654890118684957]
Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge.
The diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts.
We propose a simple method that diversifies the LLM generations, while preserving their quality.
arXiv Detail & Related papers (2024-04-25T17:52:39Z) - Quality Diversity through Human Feedback: Towards Open-Ended Diversity-Driven Optimization [13.436983663467938]
This paper introduces Quality Diversity through Human Feedback (QDHF), a novel approach that progressively infers diversity metrics from human judgments of similarity among solutions.
Empirical studies show that QDHF significantly outperforms state-of-the-art methods in automatic diversity discovery.
In open-ended generative tasks, QDHF substantially enhances the diversity of text-to-image generation from a diffusion model.
arXiv Detail & Related papers (2023-10-18T16:46:16Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.