Understanding Metrics for Paraphrasing
- URL: http://arxiv.org/abs/2205.13119v1
- Date: Thu, 26 May 2022 03:03:16 GMT
- Title: Understanding Metrics for Paraphrasing
- Authors: Omkar Patil, Rahul Singh and Tarun Joshi
- Abstract summary: We propose a novel metric $ROUGE_P$ to measure the quality of paraphrases along the dimensions of adequacy, novelty and fluency.
We look at paraphrase model fine-tuning and generation from the lens of metrics to gain a deeper understanding of what it takes to generate and evaluate a good paraphrase.
- Score: 13.268278150775
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Paraphrase generation is a difficult problem. This is not only because of the
limitations in text generation capabilities but also due that to the lack of a
proper definition of what qualifies as a paraphrase and corresponding metrics
to measure how good it is. Metrics for evaluation of paraphrasing quality is an
on going research problem. Most of the existing metrics in use having been
borrowed from other tasks do not capture the complete essence of a good
paraphrase, and often fail at borderline-cases. In this work, we propose a
novel metric $ROUGE_P$ to measure the quality of paraphrases along the
dimensions of adequacy, novelty and fluency. We also provide empirical evidence
to show that the current natural language generation metrics are insufficient
to measure these desired properties of a good paraphrase. We look at paraphrase
model fine-tuning and generation from the lens of metrics to gain a deeper
understanding of what it takes to generate and evaluate a good paraphrase.
Related papers
- Uncertainty in Language Models: Assessment through Rank-Calibration [65.10149293133846]
Language Models (LMs) have shown promising performance in natural language generation.
It is crucial to correctly quantify their uncertainty in responding to given inputs.
We develop a novel and practical framework, termed $Rank$-$Calibration$, to assess uncertainty and confidence measures for LMs.
arXiv Detail & Related papers (2024-04-04T02:31:05Z) - Cobra Effect in Reference-Free Image Captioning Metrics [58.438648377314436]
A proliferation of reference-free methods, leveraging visual-language pre-trained models (VLMs), has emerged.
In this paper, we study if there are any deficiencies in reference-free metrics.
We employ GPT-4V as an evaluative tool to assess generated sentences and the result reveals that our approach achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2024-02-18T12:36:23Z) - Improving Metrics for Speech Translation [1.2891210250935146]
We introduce Parallel Paraphrasing ($textPara_textboth$), an augmentation method for translation metrics making use of automatic paraphrasing of both the reference and hypothesis.
We show that we are able to significantly improve the correlation with human quality perception if our method is applied to commonly used metrics.
arXiv Detail & Related papers (2023-05-22T11:01:38Z) - On the Blind Spots of Model-Based Evaluation Metrics for Text Generation [79.01422521024834]
We explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics.
We design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores.
Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics.
arXiv Detail & Related papers (2022-12-20T06:24:25Z) - On the Limitations of Reference-Free Evaluations of Generated Text [64.81682222169113]
We show that reference-free metrics are inherently biased and limited in their ability to evaluate generated text.
We argue that they should not be used to measure progress on tasks like machine translation or summarization.
arXiv Detail & Related papers (2022-10-22T22:12:06Z) - Embarrassingly Easy Document-Level MT Metrics: How to Convert Any
Pretrained Metric Into a Document-Level Metric [15.646714712131148]
We present a method for extending pretrained metrics to incorporate context at the document level.
We show that the extended metrics outperform their sentence-level counterparts in about 85% of the tested conditions.
Our experimental results support our initial hypothesis and show that a simple extension of the metrics permits them to take advantage of context.
arXiv Detail & Related papers (2022-09-27T19:42:22Z) - SMART: Sentences as Basic Units for Text Evaluation [48.5999587529085]
In this paper, we introduce a new metric called SMART to mitigate such limitations.
We treat sentences as basic units of matching instead of tokens, and use a sentence matching function to soft-match candidate and reference sentences.
Our results show that system-level correlations of our proposed metric with a model-based matching function outperforms all competing metrics.
arXiv Detail & Related papers (2022-08-01T17:58:05Z) - Revisiting the Evaluation Metrics of Paraphrase Generation [35.6803390044542]
Most existing paraphrase generation models use reference-based metrics to evaluate their generated paraphrase.
This paper proposes BBScore, a reference-free metric that can reflect the generated paraphrase's quality.
arXiv Detail & Related papers (2022-02-17T07:18:54Z) - InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation [27.129551973093008]
InfoLM is a family of untrained metrics that can be viewed as a string-based metric.
This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
arXiv Detail & Related papers (2021-12-02T20:09:29Z) - When Does Translation Require Context? A Data-driven, Multilingual
Exploration [71.43817945875433]
proper handling of discourse significantly contributes to the quality of machine translation (MT)
Recent works in context-aware MT attempt to target a small set of discourse phenomena during evaluation.
We develop the Multilingual Discourse-Aware benchmark, a series of taggers that identify and evaluate model performance on discourse phenomena.
arXiv Detail & Related papers (2021-09-15T17:29:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.