Perception Score, A Learned Metric for Open-ended Text Generation
Evaluation
- URL: http://arxiv.org/abs/2008.03082v2
- Date: Tue, 18 Aug 2020 23:25:52 GMT
- Title: Perception Score, A Learned Metric for Open-ended Text Generation
Evaluation
- Authors: Jing Gu, Qingyang Wu, Zhou Yu
- Abstract summary: We propose a novel and powerful learning-based evaluation metric: Perception Score.
The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping.
- Score: 62.7690450616204
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic evaluation for open-ended natural language generation tasks remains
a challenge. Existing metrics such as BLEU show a low correlation with human
judgment. We propose a novel and powerful learning-based evaluation metric:
Perception Score. The method measures the overall quality of the generation and
scores holistically instead of only focusing on one evaluation criteria, such
as word overlapping. Moreover, it also shows the amount of uncertainty about
its evaluation result. By connecting the uncertainty, Perception Score gives a
more accurate evaluation for the generation system. Perception Score provides
state-of-the-art results on two conditional generation tasks and two
unconditional generation tasks.
Related papers
- Erasing Conceptual Knowledge from Language Models [24.63143961814566]
Erasure of Language Memory (ELM) is an evaluation paradigm centered on innocence, seamlessness, and specificity.
ELM employs targeted low-rank updates to alter output distributions for erased concepts.
We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks.
arXiv Detail & Related papers (2024-10-03T17:59:30Z) - DecompEval: Evaluating Generated Texts as Unsupervised Decomposed
Question Answering [95.89707479748161]
Existing evaluation metrics for natural language generation (NLG) tasks face the challenges on generalization ability and interpretability.
We propose a metric called DecompEval that formulates NLG evaluation as an instruction-style question answering task.
We decompose our devised instruction-style question about the quality of generated texts into the subquestions that measure the quality of each sentence.
The subquestions with their answers generated by PLMs are then recomposed as evidence to obtain the evaluation result.
arXiv Detail & Related papers (2023-07-13T16:16:51Z) - INSTRUCTSCORE: Explainable Text Generation Evaluation with Finegrained
Feedback [80.57617091714448]
We present InstructScore, an explainable evaluation metric for text generation.
We fine-tune a text evaluation metric based on LLaMA, producing a score for generated text and a human readable diagnostic report.
arXiv Detail & Related papers (2023-05-23T17:27:22Z) - On the Effectiveness of Automated Metrics for Text Generation Systems [4.661309379738428]
We propose a theory that incorporates different sources of uncertainty, such as imperfect automated metrics and insufficiently sized test sets.
The theory has practical applications, such as determining the number of samples needed to reliably distinguish the performance of a set of Text Generation systems.
arXiv Detail & Related papers (2022-10-24T08:15:28Z) - Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics.
We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z) - Re-evaluating Evaluation in Text Summarization [77.4601291738445]
We re-evaluate the evaluation method for text summarization using top-scoring system outputs.
We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.
arXiv Detail & Related papers (2020-10-14T13:58:53Z) - Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine
Translation Evaluation Metrics [64.88815792555451]
We show that current methods for judging metrics are highly sensitive to the translations used for assessment.
We develop a method for thresholding performance improvement under an automatic metric against human judgements.
arXiv Detail & Related papers (2020-06-11T09:12:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.