Related papers: Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

Related papers

Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization [13.458891794688551]
We assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks.<n>Our findings highlight the sensitivity of evaluation metrics to the language type.
arXiv Detail & Related papers (2025-07-11T06:44:52Z)
Evaluating the Evaluation of Diversity in Commonsense Generation [28.654890118684957]
We conduct a systematic meta-evaluation of diversity metrics for commonsense generation.<n>We find that form-based diversity metrics tend to consistently overestimate the diversity in sentence sets.<n>We show that content-based diversity evaluation metrics consistently outperform the form-based counterparts.
arXiv Detail & Related papers (2025-05-31T11:18:26Z)
FUSE : A Ridge and Random Forest-Based Metric for Evaluating MT in Indigenous Languages [2.377892000761193]
This paper presents the winning submission of the RaaVa team to the Americas 2025 Shared Task 3 on Automatic Evaluation Metrics for Machine Translation. We introduce Feature-Union Scorer (FUSE) for Evaluation, FUSE integrates Ridge regression and Gradient Boosting to model translation quality. Results show that FUSE consistently achieves higher Pearson and Spearman correlations with human judgments.
arXiv Detail & Related papers (2025-03-28T06:58:55Z)
Towards Better Open-Ended Text Generation: A Multicriteria Evaluation Framework [0.1979158763744267]
Open-ended text generation has become a prominent task in natural language processing. Decoding methods often excel in some metrics while underperforming in others. We present novel ranking strategies within this multicriteria framework.
arXiv Detail & Related papers (2024-10-24T11:32:01Z)
Language Model Decoding as Direct Metrics Optimization [87.68281625776282]
Current decoding methods struggle to generate texts that align with human texts across different aspects. In this work, we frame decoding from a language model as an optimization problem with the goal of strictly matching the expected performance with human texts. We prove that this induced distribution is guaranteed to improve the perplexity on human texts, which suggests a better approximation to the underlying distribution of human texts.
arXiv Detail & Related papers (2023-10-02T09:35:27Z)
Evaluation of Faithfulness Using the Longest Supported Subsequence [52.27522262537075]
We introduce a novel approach to evaluate faithfulness of machine-generated text by computing the longest noncontinuous of the claim that is supported by the context. Using a new human-annotated dataset, we finetune a model to generate Longest Supported Subsequence (LSS) Our proposed metric demonstrates an 18% enhancement over the prevailing state-of-the-art metric for faithfulness on our dataset.
arXiv Detail & Related papers (2023-08-23T14:18:44Z)
Learning Evaluation Models from Large Language Models for Sequence Generation [61.8421748792555]
We propose a three-stage evaluation model training method that utilizes large language models to generate labeled data for model-based metric development.<n> Experimental results on the SummEval benchmark demonstrate that CSEM can effectively train an evaluation model without human-labeled data.
arXiv Detail & Related papers (2023-08-08T16:41:16Z)
SEAHORSE: A Multilingual, Multifaceted Dataset for Summarization Evaluation [52.186343500576214]
We introduce SEAHORSE, a dataset for multilingual, multifaceted summarization evaluation. SEAHORSE consists of 96K summaries with human ratings along 6 dimensions of text quality. We show that metrics trained with SEAHORSE achieve strong performance on the out-of-domain meta-evaluation benchmarks TRUE and mFACE.
arXiv Detail & Related papers (2023-05-22T16:25:07Z)
Social Biases in Automatic Evaluation Metrics for NLG [53.76118154594404]
We propose an evaluation method based on Word Embeddings Association Test (WEAT) and Sentence Embeddings Association Test (SEAT) to quantify social biases in evaluation metrics. We construct gender-swapped meta-evaluation datasets to explore the potential impact of gender bias in image caption and text summarization tasks.
arXiv Detail & Related papers (2022-10-17T08:55:26Z)
A Multilingual Perspective Towards the Evaluation of Attribution Methods in Natural Language Inference [28.949004915740776]
We present a multilingual approach for evaluating attribution methods for the Natural Language Inference (NLI) task in terms of faithfulness and plausibility. First, we introduce a novel cross-lingual strategy to measure faithfulness based on word alignments, which eliminates the drawbacks of erasure-based evaluations. We then perform a comprehensive evaluation of attribution methods, considering different output mechanisms and aggregation methods.
arXiv Detail & Related papers (2022-04-11T22:11:05Z)
RoMe: A Robust Metric for Evaluating Natural Language Generation [7.594468763029502]
We propose an automatic evaluation metric incorporating several core aspects of natural language understanding. Our proposed metric, RoMe, is trained on language features such as semantic similarity combined with tree edit distance and grammatical acceptability. Empirical results suggest that RoMe has a stronger correlation to human judgment over state-of-the-art metrics in evaluating system-generated sentences.
arXiv Detail & Related papers (2022-03-17T09:07:39Z)
Dynamic Human Evaluation for Relative Model Comparisons [8.843915018287476]
We present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study.
arXiv Detail & Related papers (2021-12-15T11:32:13Z)
Bidimensional Leaderboards: Generate and Evaluate Language Hand in Hand [117.62186420147563]
We propose a generalization of leaderboards, bidimensional leaderboards (Billboards) Unlike conventional unidimensional leaderboards that sort submitted systems by predetermined metrics, a Billboard accepts both generators and evaluation metrics as competing entries. We demonstrate that a linear ensemble of a few diverse metrics sometimes substantially outperforms existing metrics in isolation.
arXiv Detail & Related papers (2021-12-08T06:34:58Z)
InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation [27.129551973093008]
InfoLM is a family of untrained metrics that can be viewed as a string-based metric. This family of metrics also makes use of information measures allowing the adaptation of InfoLM to various evaluation criteria.
arXiv Detail & Related papers (2021-12-02T20:09:29Z)
A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods [60.04142561088524]
We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.
arXiv Detail & Related papers (2021-03-31T18:28:14Z)
TextFlint: Unified Multilingual Robustness Evaluation Toolkit for Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint) It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis. TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z)
Informed Sampling for Diversity in Concept-to-Text NLG [8.883733362171034]
We propose an Imitation Learning approach to explore the level of diversity that a language generation model can reliably produce. Specifically, we augment the decoding process with a meta-classifier trained to distinguish which words at any given timestep will lead to high-quality output.
arXiv Detail & Related papers (2020-04-29T17:43:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.