Related papers: Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks

URL: http://arxiv.org/abs/2404.16966v2
Date: Wed, 5 Jun 2024 20:14:15 GMT
Title: Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks
Authors: Melissa Ailem, Katerina Marazopoulou, Charlotte Siska, James Bono,
Abstract summary: The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, and (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.
Score: 2.1899189033259305
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarks have emerged as the central approach for evaluating Large Language Models (LLMs). The research community often relies on a model's average performance across the test prompts of a benchmark to evaluate the model's performance. This is consistent with the assumption that the test prompts within a benchmark represent a random sample from a real-world distribution of interest. We note that this is generally not the case; instead, we hold that the distribution of interest varies according to the specific use case. We find that (1) the correlation in model performance across test prompts is non-random, (2) accounting for correlations across test prompts can change model rankings on major benchmarks, (3) explanatory factors for these correlations include semantic similarity and common LLM failure points.

Related papers

Statistical Hypothesis Testing for Auditing Robustness in Language Models [49.1574468325115]
We introduce distribution-based perturbation analysis, a framework that reformulates perturbation analysis as a frequentist hypothesis testing problem.<n>We construct empirical null and alternative output distributions within a low-dimensional semantic similarity space via Monte Carlo sampling.<n>We show how we can quantify response changes, measure true/false positive rates, and evaluate alignment with reference models.
arXiv Detail & Related papers (2025-06-09T17:11:07Z)
Beyond the Singular: The Essential Role of Multiple Generations in Effective Benchmark Evaluation and Analysis [10.133537818749291]
Large language models (LLMs) have demonstrated significant utilities in real-world applications. Benchmark evaluations are crucial for assessing the capabilities of LLMs.
arXiv Detail & Related papers (2025-02-13T03:43:33Z)
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models [69.38024658668887]
Current evaluation method for event extraction relies on token-level exact match. We propose RAEE, an automatic evaluation framework that accurately assesses event extraction results at semantic-level instead of token-level.
arXiv Detail & Related papers (2024-10-12T07:54:01Z)
The Mismeasure of Man and Models: Evaluating Allocational Harms in Large Language Models [22.75594773147521]
We introduce Rank-Allocation-Based Bias Index (RABBI), a model-agnostic bias measure that assesses potential allocational harms arising from biases in large language models (LLMs) Our results reveal that commonly-used bias metrics based on average performance gap and distribution distance fail to reliably capture group disparities in allocation outcomes. Our work highlights the need to account for how models are used in contexts with limited resource constraints.
arXiv Detail & Related papers (2024-08-02T14:13:06Z)
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models [41.772263447213234]
Large language models (LLMs) are known to be trained on vast amounts of data, which may unintentionally or intentionally include data from commonly used benchmarks. This inclusion can lead to cheatingly high scores on model leaderboards, yet result in disappointing performance in real-world applications. We introduce PaCoST, a Paired Confidence Significance Testing to effectively detect benchmark contamination in LLMs.
arXiv Detail & Related papers (2024-06-26T13:12:40Z)
Assessing Model Generalization in Vicinity [34.86022681163714]
This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels. We propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample. The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy.
arXiv Detail & Related papers (2024-06-13T15:58:37Z)
Generalization Ability of Feature-based Performance Prediction Models: A Statistical Analysis across Benchmarks [5.170967632369504]
We compare the statistical similarity between the problem collections with the accuracy of performance prediction models based on exploratory landscape analysis features. We observe that there is a positive correlation between these two measures. Specifically, when the high-dimensional feature value distributions between training and testing suites lack statistical significance, the model tends to generalize well.
arXiv Detail & Related papers (2024-05-20T12:39:24Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Investigating Data Contamination in Modern Benchmarks for Large Language Models [27.479260572913724]
Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs. We study data contamination by proposing two methods tailored for both open-source and proprietary LLMs. We find that certain commercial LLMs could surprisingly guess the missing option in various test sets.
arXiv Detail & Related papers (2023-11-16T11:03:04Z)
Aggregation Weighting of Federated Learning via Generalization Bound Estimation [65.8630966842025]
Federated Learning (FL) typically aggregates client model parameters using a weighting approach determined by sample proportions. We replace the aforementioned weighting method with a new strategy that considers the generalization bounds of each local model.
arXiv Detail & Related papers (2023-11-10T08:50:28Z)
Don't Make Your LLM an Evaluation Benchmark Cheater [142.24553056600627]
Large language models(LLMs) have greatly advanced the frontiers of artificial intelligence, attaining remarkable improvement in model capacity. To assess the model performance, a typical approach is to construct evaluation benchmarks for measuring the ability level of LLMs. We discuss the potential risk and impact of inappropriately using evaluation benchmarks and misleadingly interpreting the evaluation results.
arXiv Detail & Related papers (2023-11-03T14:59:54Z)
GREAT Score: Global Robustness Evaluation of Adversarial Perturbation using Generative Models [60.48306899271866]
We present a new framework, called GREAT Score, for global robustness evaluation of adversarial perturbation using generative models. We show high correlation and significantly reduced cost of GREAT Score when compared to the attack-based model ranking on RobustBench. GREAT Score can be used for remote auditing of privacy-sensitive black-box models.
arXiv Detail & Related papers (2023-04-19T14:58:27Z)
A comprehensive comparative evaluation and analysis of Distributional Semantic Models [61.41800660636555]
We perform a comprehensive evaluation of type distributional vectors, either produced by static DSMs or obtained by averaging the contextualized vectors generated by BERT. The results show that the alleged superiority of predict based models is more apparent than real, and surely not ubiquitous. We borrow from cognitive neuroscience the methodology of Representational Similarity Analysis (RSA) to inspect the semantic spaces generated by distributional models.
arXiv Detail & Related papers (2021-05-20T15:18:06Z)
LOGAN: Local Group Bias Detection by Clustering [86.38331353310114]
We argue that evaluating bias at the corpus level is not enough for understanding how biases are embedded in a model. We propose LOGAN, a new bias detection technique based on clustering. Experiments on toxicity classification and object classification tasks show that LOGAN identifies bias in a local region.
arXiv Detail & Related papers (2020-10-06T16:42:51Z)
Achieving Equalized Odds by Resampling Sensitive Attributes [13.114114427206678]
We present a flexible framework for learning predictive models that approximately satisfy the equalized odds notion of fairness. This differentiable functional is used as a penalty driving the model parameters towards equalized odds. We develop a formal hypothesis test to detect whether a prediction rule violates this property, the first such test in the literature.
arXiv Detail & Related papers (2020-06-08T00:18:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.