Standardizing the Measurement of Text Diversity: A Tool and a
Comparative Analysis of Scores
- URL: http://arxiv.org/abs/2403.00553v1
- Date: Fri, 1 Mar 2024 14:23:12 GMT
- Title: Standardizing the Measurement of Text Diversity: A Tool and a
Comparative Analysis of Scores
- Authors: Chantal Shaib, Joe Barrow, Jiuding Sun, Alexa F. Siu, Byron C.
Wallace, Ani Nenkova
- Abstract summary: We find that compression algorithms capture information similar to what is measured by slow to compute $n$-gram overlap scores.
The applicability of scores extends beyond analysis of generative models.
- Score: 30.12630686473324
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The diversity across outputs generated by large language models shapes the
perception of their quality and utility. Prompt leaks, templated answer
structure, and canned responses across different interactions are readily
noticed by people, but there is no standard score to measure this aspect of
model behavior. In this work we empirically investigate diversity scores on
English texts. We find that computationally efficient compression algorithms
capture information similar to what is measured by slow to compute $n$-gram
overlap homogeneity scores. Further, a combination of measures -- compression
ratios, self-repetition of long $n$-grams and Self-BLEU and BERTScore -- are
sufficient to report, as they have low mutual correlation with each other. The
applicability of scores extends beyond analysis of generative models; for
example, we highlight applications on instruction-tuning datasets and
human-produced texts. We release a diversity score package to facilitate
research and invite consistency across reports.
Related papers
- Using Similarity to Evaluate Factual Consistency in Summaries [2.7595794227140056]
Abstractive summarisers generate fluent summaries, but the factuality of the generated text is not guaranteed.
We propose a new zero-shot factuality evaluation metric, Sentence-BERTScore (SBERTScore), which compares sentences between the summary and the source document.
Our experiments indicate that each technique has different strengths, with SBERTScore particularly effective in identifying correct summaries.
arXiv Detail & Related papers (2024-09-23T15:02:38Z) - Long-Span Question-Answering: Automatic Question Generation and QA-System Ranking via Side-by-Side Evaluation [65.16137964758612]
We explore the use of long-context capabilities in large language models to create synthetic reading comprehension data from entire books.
Our objective is to test the capabilities of LLMs to analyze, understand, and reason over problems that require a detailed comprehension of long spans of text.
arXiv Detail & Related papers (2024-05-31T20:15:10Z) - Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context.
Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS)
Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z) - Short Answer Grading Using One-shot Prompting and Text Similarity
Scoring Model [2.14986347364539]
We developed an automated short answer grading model that provided both analytic scores and holistic scores.
The accuracy and quadratic weighted kappa of our model were 0.67 and 0.71 on a subset of the publicly available ASAG dataset.
arXiv Detail & Related papers (2023-05-29T22:05:29Z) - Enriching language models with graph-based context information to better
understand textual data [0.15469452301122172]
We experimentally demonstrate that graph-based contextualization into BERT model enhances its performance on an example of a classification task.
Specifically, on Pubmed dataset, we observed a reduction in error from 8.51% to 7.96%, while increasing the number of parameters just by 1.6%.
arXiv Detail & Related papers (2023-05-10T10:57:21Z) - Query Expansion Using Contextual Clue Sampling with Language Models [69.51976926838232]
We propose a combination of an effective filtering strategy and fusion of the retrieved documents based on the generation probability of each context.
Our lexical matching based approach achieves a similar top-5/top-20 retrieval accuracy and higher top-100 accuracy compared with the well-established dense retrieval model DPR.
For end-to-end QA, the reader model also benefits from our method and achieves the highest Exact-Match score against several competitive baselines.
arXiv Detail & Related papers (2022-10-13T15:18:04Z) - How to Find Strong Summary Coherence Measures? A Toolbox and a
Comparative Study for Summary Coherence Measure Evaluation [3.434197496862117]
We conduct a large-scale investigation of various methods for summary coherence modelling on an even playing field.
We introduce two novel analysis measures, intra-system correlation and bias matrices, that help identify biases in coherence measures and provide robustness against system-level confounders.
While none of the currently available automatic coherence measures are able to assign reliable coherence scores to system summaries across all evaluation metrics, large-scale language models show promising results, as long as fine-tuning takes into account that they need to generalize across different summary lengths.
arXiv Detail & Related papers (2022-09-14T09:42:19Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z) - Compression, Transduction, and Creation: A Unified Framework for
Evaluating Natural Language Generation [85.32991360774447]
Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives.
We propose a unifying perspective based on the nature of information change in NLG tasks.
We develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks.
arXiv Detail & Related papers (2021-09-14T01:00:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.