RoMe: A Robust Metric for Evaluating Natural Language Generation
- URL: http://arxiv.org/abs/2203.09183v1
- Date: Thu, 17 Mar 2022 09:07:39 GMT
- Title: RoMe: A Robust Metric for Evaluating Natural Language Generation
- Authors: Md Rashad Al Hasan Rony, Liubov Kovriguina, Debanjan Chaudhuri,
Ricardo Usbeck, Jens Lehmann
- Abstract summary: We propose an automatic evaluation metric incorporating several core aspects of natural language understanding.
Our proposed metric, RoMe, is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability.
Empirical results suggest that RoMe has a stronger correlation to human judgment over state-of-the-art metrics in evaluating system-generated sentences.
- Score: 7.594468763029502
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating Natural Language Generation (NLG) systems is a challenging task.
Firstly, the metric should ensure that the generated hypothesis reflects the
reference's semantics. Secondly, it should consider the grammatical quality of
the generated sentence. Thirdly, it should be robust enough to handle various
surface forms of the generated sentence. Thus, an effective evaluation metric
has to be multifaceted. In this paper, we propose an automatic evaluation
metric incorporating several core aspects of natural language understanding
(language competence, syntactic and semantic variation). Our proposed metric,
RoMe, is trained on language features such as semantic similarity combined with
tree edit distance and grammatical acceptability, using a self-supervised
neural network to assess the overall quality of the generated sentence.
Moreover, we perform an extensive robustness analysis of the state-of-the-art
methods and RoMe. Empirical results suggest that RoMe has a stronger
correlation to human judgment over state-of-the-art metrics in evaluating
system-generated sentences across several NLG tasks.
Related papers
- Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization [13.458891794688551]
We assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks.<n>Our findings highlight the sensitivity of evaluation metrics to the language type.
arXiv Detail & Related papers (2025-07-11T06:44:52Z) - FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages [2.377892000761193]
This paper presents the winning submission of the RaaVa team to the Americas 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation.
We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality.
Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments.
arXiv Detail & Related papers (2025-03-28T06:58:55Z) - Unsupervised Approach to Evaluate Sentence-Level Fluency: Do We Really
Need Reference? [3.2528685897001455]
This paper adapts an existing unsupervised technique for measuring text fluency without the need for any reference.
Our approach leverages various word embeddings and trains language models using Recurrent Neural Network (RNN) architectures.
To assess the performance of the models, we conduct a comparative analysis across 10 Indic languages.
arXiv Detail & Related papers (2023-12-03T20:09:23Z) - Large Language Models are Diverse Role-Players for Summarization
Evaluation [82.31575622685902]
A document summary's quality can be assessed by human annotators on various criteria, both objective ones like grammar and correctness, and subjective ones like informativeness, succinctness, and appeal.
Most of the automatic evaluation methods like BLUE/ROUGE may be not able to adequately capture the above dimensions.
We propose a new evaluation framework based on LLMs, which provides a comprehensive evaluation framework by comparing generated text and reference text from both objective and subjective aspects.
arXiv Detail & Related papers (2023-03-27T10:40:59Z) - The Glass Ceiling of Automatic Evaluation in Natural Language Generation [60.59732704936083]
We take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics.
Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans.
arXiv Detail & Related papers (2022-08-31T01:13:46Z) - BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation [16.81712151903078]
Natural language processing (NLP) systems are increasingly trained to generate open-ended text.
Different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others.
Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics) to make research into new metrics itself easier to evaluate.
arXiv Detail & Related papers (2021-10-18T10:03:19Z) - Evaluating the Morphosyntactic Well-formedness of Generated Texts [88.20502652494521]
We propose L'AMBRE -- a metric to evaluate the morphosyntactic well-formedness of text.
We show the effectiveness of our metric on the task of machine translation through a diachronic study of systems translating into morphologically-rich languages.
arXiv Detail & Related papers (2021-03-30T18:02:58Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Curious Case of Language Generation Evaluation Metrics: A Cautionary
Tale [52.663117551150954]
A few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation.
This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them.
In this paper, we urge the community for more careful consideration of how they automatically evaluate their models.
arXiv Detail & Related papers (2020-10-26T13:57:20Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - GRUEN for Evaluating Linguistic Quality of Generated Text [17.234442722611803]
We propose GRUEN for evaluating Grammaticality, non-Redundancy, focUs, structure and coherENce of generated text.
GRUEN utilizes a BERT-based model and a class of syntactic, semantic, and contextual features to examine the system output.
arXiv Detail & Related papers (2020-10-06T05:59:25Z) - Automating Text Naturalness Evaluation of NLG Systems [0.0]
We present an attempt to automate the evaluation of text naturalness.
Instead of relying on human participants for scoring or labeling the text samples, we propose to automate the process.
We analyze the text probability fractions and observe how they are influenced by the size of the generative and discriminative models involved in the process.
arXiv Detail & Related papers (2020-06-23T18:48:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.