Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models
- URL: http://arxiv.org/abs/2505.16245v2
- Date: Mon, 26 May 2025 17:21:01 GMT
- Title: Diverse, not Short: A Length-Controlled Self-Learning Framework for Improving Response Diversity of Language Models
- Authors: Vijeta Deshpande, Debasmita Ghose, John D. Patterson, Roger Beaty, Anna Rumshisky,
- Abstract summary: We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs.<n>We introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity.
- Score: 8.023589594229914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diverse language model responses are crucial for creative generation, open-ended tasks, and self-improvement training. We show that common diversity metrics, and even reward models used for preference optimization, systematically bias models toward shorter outputs, limiting expressiveness. To address this, we introduce Diverse, not Short (Diverse-NS), a length-controlled self-learning framework that improves response diversity while maintaining length parity. By generating and filtering preference data that balances diversity, quality, and length, Diverse-NS enables effective training using only 3,000 preference pairs. Applied to LLaMA-3.1-8B and the Olmo-2 family, Diverse-NS substantially enhances lexical and semantic diversity. We show consistent improvement in diversity with minor reduction or gains in response quality on four creative generation tasks: Divergent Associations, Persona Generation, Alternate Uses, and Creative Writing. Surprisingly, experiments with the Olmo-2 model family (7B, and 13B) show that smaller models like Olmo-2-7B can serve as effective "diversity teachers" for larger models. By explicitly addressing length bias, our method efficiently pushes models toward more diverse and expressive outputs.
Related papers
- Mind the Gap: Conformative Decoding to Improve Output Diversity of Instruction-Tuned Large Language Models [0.0]
This paper investigates the diversity gap'' for a writing prompt narrative generation task.<n>Results show significant decreases in diversity due to instruction-tuning.<n>We present a new decoding strategy, conformative decoding, which guides an instruct model using its more diverse base model to reintroduce output diversity.
arXiv Detail & Related papers (2025-07-28T16:04:25Z) - Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.<n>Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.<n>These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z) - NoveltyBench: Evaluating Language Models for Humanlike Diversity [21.6078675947446]
NoveltyBench is a benchmark designed to evaluate the ability of language models to produce multiple distinct and high-quality outputs.<n>We evaluate 20 leading language models and find that current state-of-the-art systems generate significantly less diversity than human writers.
arXiv Detail & Related papers (2025-04-07T16:14:23Z) - Modifying Large Language Model Post-Training for Diverse Creative Writing [12.872333448726595]
In creative writing generation, we investigate post-training approaches to promote both output diversity and quality.<n>Our core idea is to include deviation in the training objective to facilitate learning from rare high-quality instances.<n>Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models.
arXiv Detail & Related papers (2025-03-21T13:21:45Z) - Improving Diversity of Commonsense Generation by Large Language Models via In-Context Learning [28.654890118684957]
Generative Commonsense Reasoning (GCR) requires a model to reason about a situation using commonsense knowledge.
The diversity of the generation is equally important because it reflects the model's ability to use a range of commonsense knowledge facts.
We propose a simple method that diversifies the LLM generations, while preserving their quality.
arXiv Detail & Related papers (2024-04-25T17:52:39Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Multimodal Knowledge Alignment with Reinforcement Learning [103.68816413817372]
ESPER extends language-only zero-shot models to unseen multimodal tasks, like image and audio captioning.
Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision.
Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of zero-shot tasks.
arXiv Detail & Related papers (2022-05-25T10:12:17Z) - Scalable and Efficient MoE Training for Multitask Multilingual Models [55.987536562357086]
We develop a system capable of scaling MoE models efficiently to trillions of parameters.
We also present new training methods to improve MoE sample efficiency and leverage expert pruning strategy to improve time efficiency.
A model trained with 10 billion parameters on 50 languages can achieve state-of-the-art performance in Machine Translation (MT) and multilingual natural language generation tasks.
arXiv Detail & Related papers (2021-09-22T00:57:46Z) - Specializing Multilingual Language Models: An Empirical Study [50.7526245872855]
Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks.
For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data.
arXiv Detail & Related papers (2021-06-16T18:13:55Z) - AvgOut: A Simple Output-Probability Measure to Eliminate Dull Responses [97.50616524350123]
We build dialogue models that are dynamically aware of what utterances or tokens are dull without any feature-engineering.
The first model, MinAvgOut, directly maximizes the diversity score through the output distributions of each batch.
The second model, Label Fine-Tuning (LFT), prepends to the source sequence a label continuously scaled by the diversity score to control the diversity level.
The third model, RL, adopts Reinforcement Learning and treats the diversity score as a reward signal.
arXiv Detail & Related papers (2020-01-15T18:32:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.