Related papers: A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations

URL: http://arxiv.org/abs/2507.15092v1
Date: Sun, 20 Jul 2025 19:14:43 GMT
Title: A Penalty Goes a Long Way: Measuring Lexical Diversity in Synthetic Texts Under Prompt-Influenced Length Variations
Authors: Vijeta Deshpande, Ishita Dasgupta, Uttaran Bhattacharya, Somdeb Sarkhel, Saayan Mitra, Anna Rumshisky,
Abstract summary: Penalty-Adjusted Type-Token Ratio (PATTR) is a diversity metric robust to length variations.<n>We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families.
Score: 21.27593629875137
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Synthetic text generated by Large Language Models (LLMs) is increasingly used for further training and improvement of LLMs. Diversity is crucial for the effectiveness of synthetic data, and researchers rely on prompt engineering to improve diversity. However, the impact of prompt variations on response text length, and, more importantly, the consequential effect on lexical diversity measurements, remain underexplored. In this work, we propose Penalty-Adjusted Type-Token Ratio (PATTR), a diversity metric robust to length variations. We generate a large synthetic corpus of over 20M words using seven models from the LLaMA, OLMo, and Phi families, focusing on a creative writing task of video script generation, where diversity is crucial. We evaluate per-response lexical diversity using PATTR and compare it against existing metrics of Moving-Average TTR (MATTR) and Compression Ratio (CR). Our analysis highlights how text length variations introduce biases favoring shorter responses. Unlike existing metrics, PATTR explicitly considers the task-specific target response length ($L_T$) to effectively mitigate length biases. We further demonstrate the utility of PATTR in filtering the top-10/100/1,000 most lexically diverse responses, showing that it consistently outperforms MATTR and CR by yielding on par or better diversity with high adherence to $L_T$.

Related papers

Measuring diversity of synthetic prompts and data generated with fine-grained persona prompting [2.773884499834578]
We measure the diversity of persona-driven synthetically generated prompts and responses with a suite of lexical diversity and redundancy metrics.<n>We find that synthetic prompts are significantly less diverse than human-written ones.<n>While persona-prompting does improve lexical diversity (especially with larger models), fine-grained detail in personas doesn't increase diversity noticeably.
arXiv Detail & Related papers (2025-05-23T02:00:00Z)
Evaluating the Diversity and Quality of LLM Generated Content [72.84945252821908]
We introduce a framework for measuring effective semantic diversity--diversity among outputs that meet quality thresholds.<n>Although preference-tuned models exhibit reduced lexical and syntactic diversity, they produce greater effective semantic diversity than SFT or base models.<n>These findings have important implications for applications that require diverse yet high-quality outputs.
arXiv Detail & Related papers (2025-04-16T23:02:23Z)
Improving Linguistic Diversity of Large Language Models with Possibility Exploration Fine-Tuning [23.456302461693053]
Possibility Exploration Fine-Tuning (PEFT) is a task-agnostic framework that enhances the text diversity of Large Language Models (LLMs) without increasing latency or computational cost.<n>PEFT significantly enhances the diversity of LLM outputs, as evidenced by lower similarity between candidate responses.<n>It can also notably reduce demographic bias in dialogue systems.
arXiv Detail & Related papers (2024-12-04T14:23:16Z)
Balancing Diversity and Risk in LLM Sampling: How to Select Your Method and Parameter for Open-Ended Text Generation [60.493180081319785]
We propose a systematic way to estimate the capacity of a truncation sampling method by considering the trade-off between diversity and risk at each decoding step.<n>Our work offers a comprehensive comparison of existing truncation sampling methods and serves as a practical user guideline for their parameter selection.
arXiv Detail & Related papers (2024-08-24T14:14:32Z)
GANPrompt: Enhancing Robustness in LLM-Based Recommendations with GAN-Enhanced Diversity Prompts [15.920623515602038]
Large Language Models (LLMs) are highly susceptible to the influence of prompt words.<n>This paper proposes GANPrompt, a multi-dimensional LLMs prompt diversity framework based on Generative Adversarial Networks (GANs)<n>The framework enhances the model's adaptability and stability to diverse prompts by integrating GANs generation techniques with the deep semantic understanding capabilities of LLMs.
arXiv Detail & Related papers (2024-08-19T03:13:20Z)
SynthesizRR: Generating Diverse Datasets with Retrieval Augmentation [55.2480439325792]
We study the synthesis of six datasets, covering topic classification, sentiment analysis, tone detection, and humor. We find that SynthesizRR greatly improves lexical and semantic diversity, similarity to human-written text, and distillation performance.
arXiv Detail & Related papers (2024-05-16T12:22:41Z)
Scaling Data Diversity for Fine-Tuning Language Models in Human Alignment [84.32768080422349]
Alignment with human preference prevents large language models from generating misleading or toxic content. We propose a new formulation of prompt diversity, implying a linear correlation with the final performance of LLMs after fine-tuning.
arXiv Detail & Related papers (2024-03-17T07:08:55Z)
Standardizing the Measurement of Text Diversity: A Tool and a Comparative Analysis of Scores [28.431348662950743]
We release Python package for measuring and extracting repetition in text.<n>We build a platform based on diversity for users to interactively explore repetition in text.
arXiv Detail & Related papers (2024-03-01T14:23:12Z)
Exploring Diversity in Back Translation for Low-Resource Machine Translation [85.03257601325183]
Back translation is one of the most widely used methods for improving the performance of neural machine translation systems. Recent research has sought to enhance the effectiveness of this method by increasing the 'diversity' of the generated translations. This work puts forward a more nuanced framework for understanding diversity in training data, splitting it into lexical diversity and syntactic diversity.
arXiv Detail & Related papers (2022-06-01T15:21:16Z)
Exploring Dimensionality Reduction Techniques in Multilingual Transformers [64.78260098263489]
This paper gives a comprehensive account of the impact of dimensional reduction techniques on the performance of state-of-the-art multilingual Siamese Transformers. It shows that it is possible to achieve an average reduction in the number of dimensions of $91.58% pm 2.59%$ and $54.65% pm 32.20%$, respectively.
arXiv Detail & Related papers (2022-04-18T17:20:55Z)
MGD-GAN: Text-to-Pedestrian generation through Multi-Grained Discrimination [96.91091607251526]
We propose the Multi-Grained Discrimination enhanced Generative Adversarial Network, that capitalizes a human-part-based Discriminator and a self-cross-attended Discriminator. A fine-grained word-level attention mechanism is employed in the HPD module to enforce diversified appearance and vivid details. The substantial improvement over the various metrics demonstrates the efficacy of MGD-GAN on the text-to-pedestrian synthesis scenario.
arXiv Detail & Related papers (2020-10-02T12:24:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.