Revisiting the Evaluation Metrics of Paraphrase Generation
- URL: http://arxiv.org/abs/2202.08479v1
- Date: Thu, 17 Feb 2022 07:18:54 GMT
- Title: Revisiting the Evaluation Metrics of Paraphrase Generation
- Authors: Lingfeng Shen, Haiyun Jiang, Lemao Liu, Shuming Shi
- Abstract summary: Most existing paraphrase generation models use reference-based metrics to evaluate their generated paraphrase.
This paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
- Score: 35.6803390044542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Paraphrase generation is an important NLP task that has achieved significant
progress recently. However, one crucial problem is overlooked, `how to evaluate
the quality of paraphrase?'. Most existing paraphrase generation models use
reference-based metrics (e.g., BLEU) from neural machine translation (NMT) to
evaluate their generated paraphrase. Such metrics' reliability is hardly
evaluated, and they are only plausible when there exists a standard reference.
Therefore, this paper first answers one fundamental question, `Are existing
metrics reliable for paraphrase generation?'. We present two conclusions that
disobey conventional wisdom in paraphrasing generation: (1) existing metrics
poorly align with human annotation in system-level and segment-level paraphrase
evaluation. (2) reference-free metrics outperform reference-based metrics,
indicating that the standard references are unnecessary to evaluate the
paraphrase's quality. Such empirical findings expose a lack of reliable
automatic evaluation metrics. Therefore, this paper proposes BBScore, a
reference-free metric that can reflect the generated paraphrase's quality.
BBScore consists of two sub-metrics: S3C score and SelfBLEU, which correspond
to two criteria for paraphrase evaluation: semantic preservation and diversity.
By connecting two sub-metrics, BBScore significantly outperforms existing
paraphrase evaluation metrics.
Related papers
- Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics [4.881135687863645]
We introduce a reference-free metric that correlates well with human evaluated relevance, while being very cheap to compute.
We show that this metric can also be used alongside reference-based metrics to improve their robustness in low quality reference settings.
arXiv Detail & Related papers (2024-10-08T11:09:25Z) - Reference-based Metrics Disprove Themselves in Question Generation [17.83616985138126]
We find that using human-written references cannot guarantee the effectiveness of reference-based metrics.
A good metric is expected to grade a human-validated question no worse than generated questions.
We propose a reference-free metric consisted of multi-dimensional criteria such as naturalness, answerability, and complexity.
arXiv Detail & Related papers (2024-03-18T20:47:10Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - KPEval: Towards Fine-Grained Semantic-Based Keyphrase Evaluation [69.57018875757622]
We propose KPEval, a comprehensive evaluation framework consisting of four critical aspects: reference agreement, faithfulness, diversity, and utility.
Using KPEval, we re-evaluate 23 keyphrase systems and discover that established model comparison results have blind-spots.
arXiv Detail & Related papers (2023-03-27T17:45:38Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - Understanding Metrics for Paraphrasing [13.268278150775]
We propose a novel metric $ROUGE_P$ to measure the quality of paraphrases along the dimensions of adequacy, novelty and fluency.
We look at paraphrase model fine-tuning and generation from the lens of metrics to gain a deeper understanding of what it takes to generate and evaluate a good paraphrase.
arXiv Detail & Related papers (2022-05-26T03:03:16Z) - Language Model Augmented Relevance Score [2.8314622515446835]
Language Model Augmented Relevance Score (MARS) is a new context-aware metric for NLG evaluation.
MARS uses off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.
arXiv Detail & Related papers (2021-08-19T03:59:23Z) - REAM$\sharp$: An Enhancement Approach to Reference-based Evaluation
Metrics for Open-domain Dialog Generation [63.46331073232526]
We present an enhancement approach to Reference-based EvAluation Metrics for open-domain dialogue systems.
A prediction model is designed to estimate the reliability of the given reference set.
We show how its predicted results can be helpful to augment the reference set, and thus improve the reliability of the metric.
arXiv Detail & Related papers (2021-05-30T10:04:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.