Language Model Augmented Relevance Score
- URL: http://arxiv.org/abs/2108.08485v1
- Date: Thu, 19 Aug 2021 03:59:23 GMT
- Title: Language Model Augmented Relevance Score
- Authors: Ruibo Liu, Jason Wei, Soroush Vosoughi
- Abstract summary: Language Model Augmented Relevance Score (MARS) is a new context-aware metric for NLG evaluation.
MARS uses off-the-shelf language models, guided by reinforcement learning, to create augmented references that consider both the generation context and available human references.
- Score: 2.8314622515446835
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Although automated metrics are commonly used to evaluate NLG systems, they
often correlate poorly with human judgements. Newer metrics such as BERTScore
have addressed many weaknesses in prior metrics such as BLEU and ROUGE, which
rely on n-gram matching. These newer methods, however, are still limited in
that they do not consider the generation context, so they cannot properly
reward generated text that is correct but deviates from the given reference.
In this paper, we propose Language Model Augmented Relevance Score (MARS), a
new context-aware metric for NLG evaluation. MARS leverages off-the-shelf
language models, guided by reinforcement learning, to create augmented
references that consider both the generation context and available human
references, which are then used as additional references to score generated
text. Compared with seven existing metrics in three common NLG tasks, MARS not
only achieves higher correlation with human reference judgements, but also
differentiates well-formed candidates from adversarial samples to a larger
degree.
Related papers
- Is Reference Necessary in the Evaluation of NLG Systems? When and Where? [58.52957222172377]
We show that reference-free metrics exhibit a higher correlation with human judgment and greater sensitivity to deficiencies in language quality.
Our study can provide insight into the appropriate application of automatic metrics and the impact of metric choice on evaluation performance.
arXiv Detail & Related papers (2024-03-21T10:31:11Z) - Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really
Need Reference? [3.2528685897001455]
This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference.
Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures.
To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages.
arXiv Detail & Related papers (2023-12-03T20:09:23Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - Not All Metrics Are Guilty: Improving NLG Evaluation by Diversifying References [123.39034752499076]
Div-Ref is a method to enhance evaluation benchmarks by enriching the number of references.
We conduct experiments to empirically demonstrate that diversifying the expression of reference can significantly enhance the correlation between automatic evaluation and human evaluation.
arXiv Detail & Related papers (2023-05-24T11:53:29Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards)
Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries.
We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z) - UNION: An Unreferenced Metric for Evaluating Open-ended Story Generation [92.42032403795879]
UNION is a learnable unreferenced metric for evaluating open-ended story generation.
It is trained to distinguish human-written stories from negative samples and recover the perturbation in negative stories.
Experiments on two story datasets demonstrate that UNION is a reliable measure for evaluating the quality of generated stories.
arXiv Detail & Related papers (2020-09-16T11:01:46Z) - BLEU might be Guilty but References are not Innocent [34.817010352734]
We study different methods to collect references and compare their value in automated evaluation.
Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task.
Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output.
arXiv Detail & Related papers (2020-04-13T16:49:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.