MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text
Generation
- URL: http://arxiv.org/abs/2102.01454v1
- Date: Tue, 2 Feb 2021 11:59:28 GMT
- Title: MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text
Generation
- Authors: Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun,
Yejin Choi, Zaid Harchaoui
- Abstract summary: We propose MAUVE -- a metric for open-ended text generation.
We present experiments across two open-ended generation tasks in the web text domain and the story domain.
- Score: 41.360219974284114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite major advances in open-ended text generation, there has been limited
progress in designing evaluation metrics for this task. We propose MAUVE -- a
metric for open-ended text generation, which directly compares the distribution
of machine-generated text to that of human language. MAUVE measures the mean
area under the divergence curve for the two distributions, exploring the
trade-off between two types of errors: those arising from parts of the human
distribution that the model distribution approximates well, and those it does
not. We present experiments across two open-ended generation tasks in the web
text domain and the story domain, and a variety of decoding algorithms and
model sizes. Our results show that evaluation under MAUVE indeed reflects the
more natural behavior with respect to model size, compared to prior metrics.
MAUVE's ordering of the decoding algorithms also agrees with that of generation
perplexity, the most widely used metric in open-ended text generation; however,
MAUVE presents a more principled evaluation metric for the task as it considers
both model and human text.
Related papers
- Exploring Precision and Recall to assess the quality and diversity of LLMs [82.21278402856079]
We introduce a novel evaluation framework for Large Language Models (LLMs) such as textscLlama-2 and textscMistral.
This approach allows for a nuanced assessment of the quality and diversity of generated text without the need for aligned corpora.
arXiv Detail & Related papers (2024-02-16T13:53:26Z) - Unlocking Structure Measuring: Introducing PDD, an Automatic Metric for Positional Discourse Coherence [39.065349875944634]
We present a novel metric designed to quantify the discourse divergence between two long-form articles.
Our metric aligns more closely with human preferences and GPT-4 coherence evaluation, outperforming existing evaluation methods.
arXiv Detail & Related papers (2024-02-15T18:23:39Z) - Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects.
In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts.
We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z) - Open-Domain Text Evaluation via Contrastive Distribution Methods [75.59039812868681]
We introduce a novel method for evaluating open-domain text generation called Contrastive Distribution Methods.
Our experiments on coherence evaluation for multi-turn dialogue and commonsense evaluation for controllable generation demonstrate CDM's superior correlate with human judgment.
arXiv Detail & Related papers (2023-06-20T20:37:54Z) - MISMATCH: Fine-grained Evaluation of Machine-generated Text with
Mismatch Error Types [68.76742370525234]
We propose a new evaluation scheme to model human judgments in 7 NLP tasks, based on the fine-grained mismatches between a pair of texts.
Inspired by the recent efforts in several NLP tasks for fine-grained evaluation, we introduce a set of 13 mismatch error types.
We show that the mismatch errors between the sentence pairs on the held-out datasets from 7 NLP tasks align well with the human evaluation.
arXiv Detail & Related papers (2023-06-18T01:38:53Z) - MAUVE Scores for Generative Models: Theory and Practice [95.86006777961182]
We present MAUVE, a family of comparison measures between pairs of distributions such as those encountered in the generative modeling of text or images.
We find that MAUVE can quantify the gaps between the distributions of human-written text and those of modern neural language models.
We demonstrate in the vision domain that MAUVE can identify known properties of generated images on par with or better than existing metrics.
arXiv Detail & Related papers (2022-12-30T07:37:40Z) - Distributional Discrepancy: A Metric for Unconditional Text Generation [6.6159481812419045]
The purpose of unconditional text generation is to train a model with real sentences, then generate novel sentences of the same quality and diversity as the training data.
A novel metric of distributional discrepancy (DD) is designed to evaluate generators based on the discrepancy between the generated and real training sentences.
DD is significantly better than the three existing metrics for ranking these generative models.
arXiv Detail & Related papers (2020-05-04T05:53:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.