Related papers: Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates

URL: http://arxiv.org/abs/2408.12263v1
Date: Thu, 22 Aug 2024 10:00:20 GMT
Title: Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates
Authors: Yusuke Sakai, Adam Nohejl, Jiangnan Hang, Hidetaka Kamigaito, Taro Watanabe,
Abstract summary: The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates.
Score: 24.46103924394483
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.

Related papers

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles [32.121191446326876]
We identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks.<n>Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance.
arXiv Detail & Related papers (2025-07-29T18:59:09Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
When LLMs Struggle: Reference-less Translation Evaluation for Low-resource Languages [9.138590152838754]
Segment-level quality estimation (QE) is a challenging cross-lingual language understanding task. We comprehensively evaluate large language models (LLMs) in zero/few-shot scenarios. Our results indicate that prompt-based approaches are outperformed by the encoder-based fine-tuned QE models.
arXiv Detail & Related papers (2025-01-08T12:54:05Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
ProverbEval: Exploring LLM Evaluation Challenges for Low-resource Language Understanding [15.93642619347214]
We introduce proverbeval, LLM evaluation benchmark for low-resource languages. Native language proverb descriptions significantly improve tasks such as proverb generation. monolingual evaluations consistently outperformed their cross-lingual counterparts in generation tasks.
arXiv Detail & Related papers (2024-11-07T06:34:48Z)
Evaluation of Instruction-Following Ability for Large Language Models on Story-Ending Generation [2.4889060833127665]
In this paper, we focus on evaluating the instruction-following ability of Large Language Models (LLMs) in the context of story-ending generation. We propose an automatic evaluation pipeline that utilizes a machine reading comprehension (MRC) model to determine whether the generated story-ending reflects instruction.
arXiv Detail & Related papers (2024-06-24T06:53:36Z)
DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators. The question of how reliable these evaluators are has emerged as a crucial research question. We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z)
Language Models can Evaluate Themselves via Probability Discrepancy [38.54454263880133]
We propose a new self-evaluation method ProbDiff for assessing the efficacy of various Large Language Models (LLMs) It uniquely utilizes the LLMs being tested to compute the probability discrepancy between the initial response and its revised versions. Our findings reveal that ProbDiff achieves results on par with those obtained from evaluations based on GPT-4.
arXiv Detail & Related papers (2024-05-17T03:50:28Z)
RepEval: Effective Text Evaluation with LLM Representation [55.26340302485898]
RepEval is a metric that leverages the projection of Large Language Models (LLMs) representations for evaluation. Our work underscores the richness of information regarding text quality embedded within LLM representations, offering insights for the development of new metrics.
arXiv Detail & Related papers (2024-04-30T13:50:55Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
SemScore: Automated Evaluation of Instruction-Tuned LLMs based on Semantic Textual Similarity [3.3162484539136416]
We propose a simple but remarkably effective evaluation metric called SemScore. We compare model outputs to gold target responses using semantic textual similarity (STS) We find that our proposed SemScore metric outperforms all other, in many cases more complex, evaluation metrics in terms of correlation to human evaluation.
arXiv Detail & Related papers (2024-01-30T14:52:50Z)
Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization [132.25202059478065]
We benchmark large language models (LLMs) on instruction controllable text summarization. Our study reveals that instruction controllable text summarization remains a challenging task for LLMs.
arXiv Detail & Related papers (2023-11-15T18:25:26Z)
Evaluating Large Language Models at Evaluating Instruction Following [54.49567482594617]
We introduce a challenging meta-evaluation benchmark, LLMBar, designed to test the ability of an LLM evaluator in discerning instruction-following outputs. We discover that different evaluators exhibit distinct performance on LLMBar and even the highest-scoring ones have substantial room for improvement.
arXiv Detail & Related papers (2023-10-11T16:38:11Z)
Evaluating the Performance of Large Language Models on GAOKAO Benchmark [53.663757126289795]
This paper introduces GAOKAO-Bench, an intuitive benchmark that employs questions from the Chinese GAOKAO examination as test samples. With human evaluation, we obtain the converted total score of LLMs, including GPT-4, ChatGPT and ERNIE-Bot. We also use LLMs to grade the subjective questions, and find that model scores achieve a moderate level of consistency with human scores.
arXiv Detail & Related papers (2023-05-21T14:39:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.