Related papers: On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation

On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation

URL: http://arxiv.org/abs/2205.16001v4
Date: Thu, 29 Jun 2023 15:08:38 GMT
Title: On the Usefulness of Embeddings, Clusters and Strings for Text Generator Evaluation
Authors: Tiago Pimentel, Clara Meister, Ryan Cotterell
Abstract summary: Mauve measures an information-theoretic divergence between two probability distributions over strings. We show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. We conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.
Score: 86.19634542434711
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A good automatic evaluation metric for language generation ideally correlates highly with human judgements of text quality. Yet, there is a dearth of such metrics, which inhibits the rapid and efficient progress of language generators. One exception is the recently proposed Mauve. In theory, Mauve measures an information-theoretic divergence between two probability distributions over strings: one representing the language generator under evaluation; the other representing the true natural language distribution. Mauve's authors argue that its success comes from the qualitative properties of their proposed divergence. Yet in practice, as this divergence is uncomputable, Mauve approximates it by measuring the divergence between multinomial distributions over clusters instead, where cluster assignments are attained by grouping strings based on a pre-trained language model's embeddings. As we show, however, this is not a tight approximation -- in either theory or practice. This begs the question: why does Mauve work so well? In this work, we show that Mauve was right for the wrong reasons, and that its newly proposed divergence is not necessary for its high performance. In fact, classical divergences paired with its proposed cluster-based approximation may actually serve as better evaluation metrics. We finish the paper with a probing analysis; this analysis leads us to conclude that -- by encoding syntactic- and coherence-level features of text, while ignoring surface-level features -- such cluster-based substitutes to string distributions may simply be better for evaluating state-of-the-art language generators.

Related papers

Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation. It is crucial to correctly quantify their uncertainty in responding to given inputs. We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z)
Closing the Curious Case of Neural Text Degeneration [91.22954750742183]
We provide a theoretical explanation for the effectiveness of the truncation sampling. We show that we can leverage a known source of model errors, the softmax bottleneck, to prove that certain tokens have nonzero true probability. Our evaluations show that our method outperforms its threshold-based counterparts for low-entropy text generation.
arXiv Detail & Related papers (2023-10-02T23:16:25Z)
Probabilistic Method of Measuring Linguistic Productivity [0.0]
I propose a new way of measuring linguistic productivity that objectively assesses the ability of an affix to be used to coin new complex words. token frequency does not dominate the productivity measure but naturally influences the sampling of bases. A corpus-based approach and randomised design assure that true neologisms and words coined long ago have equal chances to be selected.
arXiv Detail & Related papers (2023-08-24T08:36:28Z)
On the Efficacy of Sampling Adapters [82.5941326570812]
We propose a unified framework for understanding sampling adapters. We argue that the shift they enforce can be viewed as a trade-off between precision and recall. We find that several precision-emphasizing measures indeed indicate that sampling adapters can lead to probability distributions more aligned with the true distribution.
arXiv Detail & Related papers (2023-07-07T17:59:12Z)
On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics. We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z)
Typical Decoding for Natural Language Generation [76.69397802617064]
We study why high-probability texts can be dull or repetitive. We show that typical sampling offers competitive performance in terms of quality.
arXiv Detail & Related papers (2022-02-01T18:58:45Z)
Distributional Discrepancy: A Metric for Unconditional Text Generation [6.6159481812419045]
The purpose of unconditional text generation is to train a model with real sentences, then generate novel sentences of the same quality and diversity as the training data. A novel metric of distributional discrepancy (DD) is designed to evaluate generators based on the discrepancy between the generated and real training sentences. DD is significantly better than the three existing metrics for ranking these generative models.
arXiv Detail & Related papers (2020-05-04T05:53:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.