ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
- URL: http://arxiv.org/abs/2504.02106v1
- Date: Wed, 02 Apr 2025 20:11:45 GMT
- Title: ContrastScore: Towards Higher Quality, Less Biased, More Efficient Evaluation Metrics with Contrastive Evaluation
- Authors: Xiao Wang, Daniil Larionov, Siwei Wu, Yiqi Liu, Steffen Eger, Nafise Sadat Moosavi, Chenghua Lin,
- Abstract summary: We introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text.<n> Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines.
- Score: 38.49478115331574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Evaluating the quality of generated text automatically remains a significant challenge. Conventional reference-based metrics have been shown to exhibit relatively weak correlation with human evaluations. Recent research advocates the use of large language models (LLMs) as source-based metrics for natural language generation (NLG) assessment. While promising, LLM-based metrics, particularly those using smaller models, still fall short in aligning with human judgments. In this work, we introduce ContrastScore, a contrastive evaluation metric designed to enable higher-quality, less biased, and more efficient assessment of generated text. We evaluate ContrastScore on two NLG tasks: machine translation and summarization. Experimental results show that ContrastScore consistently achieves stronger correlation with human judgments than both single-model and ensemble-based baselines. Notably, ContrastScore based on Qwen 3B and 0.5B even outperforms Qwen 7B, despite having only half as many parameters, demonstrating its efficiency. Furthermore, it effectively mitigates common evaluation biases such as length and likelihood preferences, resulting in more robust automatic evaluation.
Related papers
- Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation [57.380464382910375]
We show that the choice of feedback protocol can significantly affect evaluation reliability and induce systematic biases.
In particular, we show that pairwise evaluation protocols are more vulnerable to distracted evaluation.
arXiv Detail & Related papers (2025-04-20T19:05:59Z) - Towards Understanding the Robustness of LLM-based Evaluations under Perturbations [9.944512689015998]
Large Language Models (LLMs) can serve as automatic evaluators for non-standardized metrics in summarization and dialog-based tasks.<n>We conduct experiments across multiple prompting strategies to examine how LLMs fare as quality evaluators when compared with human judgments.
arXiv Detail & Related papers (2024-12-12T13:31:58Z) - Aligning Model Evaluations with Human Preferences: Mitigating Token Count Bias in Language Model Assessments [2.1370543868467275]
This follow-up paper explores methods to align Large Language Models evaluator preferences with human evaluations.
We employed Bayesian statistics and a t-test to quantify this bias and developed a recalibration procedure to adjust the GPTScorer.
Our findings significantly improve aligning the recalibrated LLM evaluator with human evaluations across multiple use cases.
arXiv Detail & Related papers (2024-07-05T09:26:40Z) - The Comparative Trap: Pairwise Comparisons Amplifies Biased Preferences of LLM Evaluators [31.520403357740317]
Large language models (LLMs) are increasingly used as evaluators for natural language generation tasks.
LLMs display biased preferences, such as favoring verbosity and authoritative tones.
We introduce PRePair, which integrates pointwise reasoning within a pairwise framework.
arXiv Detail & Related papers (2024-06-18T06:43:04Z) - DCR-Consistency: Divide-Conquer-Reasoning for Consistency Evaluation and
Improvement of Large Language Models [4.953092503184905]
This work proposes DCR, an automated framework for evaluating and improving the consistency of Large Language Models (LLMs) generated texts.
We introduce an automatic metric converter (AMC) that translates the output from DCE into an interpretable numeric score.
Our approach also substantially reduces nearly 90% of output inconsistencies, showing promise for effective hallucination mitigation.
arXiv Detail & Related papers (2024-01-04T08:34:16Z) - What is the Best Automated Metric for Text to Motion Generation? [19.71712698183703]
There is growing interest in generating skeleton-based human motions from natural language descriptions.
Human evaluation is the ultimate accuracy measure for this task, and automated metrics should correlate well with human quality judgments.
This paper systematically studies which metrics best align with human evaluations and proposes new metrics that align even better.
arXiv Detail & Related papers (2023-09-19T01:59:54Z) - Towards Multiple References Era -- Addressing Data Leakage and Limited
Reference Diversity in NLG Evaluation [55.92852268168816]
N-gram matching-based evaluation metrics, such as BLEU and chrF, are widely utilized across a range of natural language generation (NLG) tasks.
Recent studies have revealed a weak correlation between these matching-based metrics and human evaluations.
We propose to utilize textitmultiple references to enhance the consistency between these metrics and human evaluations.
arXiv Detail & Related papers (2023-08-06T14:49:26Z) - LLM Comparative Assessment: Zero-shot NLG Evaluation through Pairwise
Comparisons using Large Language Models [55.60306377044225]
Large language models (LLMs) have enabled impressive zero-shot capabilities across various natural language tasks.
This paper explores two options for exploiting the emergent abilities of LLMs for zero-shot NLG assessment.
For moderate-sized open-source LLMs, such as FlanT5 and Llama2-chat, comparative assessment is superior to prompt scoring.
arXiv Detail & Related papers (2023-07-15T22:02:12Z) - G-Eval: NLG Evaluation using GPT-4 with Better Human Alignment [64.01972723692587]
We present G-Eval, a framework of using large language models with chain-of-thoughts (CoT) and a form-filling paradigm to assess the quality of NLG outputs.
We show that G-Eval with GPT-4 as the backbone model achieves a Spearman correlation of 0.514 with human on summarization task, outperforming all previous methods by a large margin.
arXiv Detail & Related papers (2023-03-29T12:46:54Z) - Toward Human-Like Evaluation for Natural Language Generation with Error
Analysis [93.34894810865364]
Recent studies show that considering both major errors (e.g. mistranslated tokens) and minor errors can produce high-quality human judgments.
This inspires us to approach the final goal of the evaluation metrics (human-like evaluations) by automatic error analysis.
We augment BARTScore by incorporating the human-like error analysis strategies, namely BARTScore++, where the final score consists of both the evaluations of major errors and minor errors.
arXiv Detail & Related papers (2022-12-20T11:36:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.