Related papers: What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks

URL: http://arxiv.org/abs/2504.07825v1
Date: Thu, 10 Apr 2025 15:01:46 GMT
Title: What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
Authors: Pavel Chizhov, Mattia Nee, Pierre-Carl Langlais, Ivan P. Yamshchikov,
Abstract summary: We show that one of the most widely used benchmarks for evaluating common-sense reasoning, HellaSwag, has severe construct validity issues.<n>We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state.
Score: 8.012203293561196
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Common-sense reasoning is a key language model capability because it encapsulates not just specific factual knowledge but rather general language and world understanding. Measuring common-sense reasoning, therefore, is crucial for language models of different sizes and applications. One of the most widely used benchmarks for evaluating such capabilities is HellaSwag; however, in this paper, we show that it has severe construct validity issues. These issues range from basic ungrammaticality and numerous typos to misleading prompts or equally correct options. Furthermore, we show that if models are evaluated only on answer texts, or with "Lorem ipsum dolor..." instead of the question, more than 65% of model predictions remain the same, and this cannot be attributed merely to contamination. Since benchmark scores are an essential part of model selection in both research and commercial applications, these validity issues can have severe consequences. In particular, knowing that taking benchmark scores at face value is ubiquitous, inadequate evaluation leads to ill-informed decisions about models. In this paper, we thoroughly investigate critical validity issues posed by HellaSwag and illustrate them with various evaluations using generative language models of different sizes. We argue that this benchmark does not accurately measure common-sense reasoning and, therefore, should not be used for evaluation in its current state. Based on the results of our study, we propose requirements that should be met by future common-sense reasoning benchmarks. In addition, we release GoldenSwag, a corrected subset of HellaSwag, which, to our belief, facilitates acceptable common-sense reasoning evaluation.

Related papers

KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language [2.594684920405059]
We present KOFFVQA, a general-purpose free-form visual question answering benchmark in the Korean language.<n>Our benchmark consists of 275 carefully crafted questions each paired with an image and grading criteria.<n>We experimentally verify that our method of using pre-existing grading criteria for evaluation is much more reliable than existing methods.
arXiv Detail & Related papers (2025-03-31T05:04:25Z)
Reliable and Efficient Amortized Model-based Evaluation [57.6469531082784]
The average score across a wide range of benchmarks provides a signal that helps guide the use of language models in practice.<n>A popular attempt to lower the cost is to compute the average score on a subset of the benchmark.<n>This approach often renders an unreliable measure of LM performance because the average score is often confounded with the difficulty of the questions in the benchmark subset.<n>We train a model that predicts question difficulty from its content, enabling a reliable measurement at a fraction of the cost.
arXiv Detail & Related papers (2025-03-17T16:15:02Z)
Do Large Language Model Benchmarks Test Reliability? [66.1783478365998]
We investigate how well current benchmarks quantify model reliability.<n>Motivated by this gap in the evaluation of reliability, we propose the concept of so-called platinum benchmarks.<n>We evaluate a wide range of models on these platinum benchmarks and find that, indeed, frontier LLMs still exhibit failures on simple tasks.
arXiv Detail & Related papers (2025-02-05T18:58:19Z)
A Critical Review of Causal Reasoning Benchmarks for Large Language Models [2.1311710788645617]
We present a comprehensive overview of LLM benchmarks for causality. We derive a set of criteria that a useful benchmark or set of benchmarks should aim to satisfy.
arXiv Detail & Related papers (2024-07-10T20:11:51Z)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.31327813151208]
BiGGen Bench is a principled generation benchmark designed to thoroughly evaluate nine distinct capabilities of LMs across 77 diverse tasks.<n>A key feature of the BiGGen Bench is its use of instance-specific evaluation criteria, closely mirroring the nuanced discernment of human evaluation.
arXiv Detail & Related papers (2024-06-09T12:30:30Z)
Bias in Language Models: Beyond Trick Tests and Toward RUTEd Evaluation [49.3814117521631]
Standard benchmarks of bias and fairness in large language models (LLMs) measure the association between social attributes implied in user prompts and short responses. We develop analogous RUTEd evaluations from three contexts of real-world use. We find that standard bias metrics have no significant correlation with the more realistic bias metrics.
arXiv Detail & Related papers (2024-02-20T01:49:15Z)
Dialectical language model evaluation: An initial appraisal of the commonsense spatial reasoning abilities of LLMs [10.453404263936335]
We explore an alternative dialectical evaluation of language models for commonsense reasoning. The goal of this kind of evaluation is not to obtain an aggregate performance value but to find failures and map the boundaries of the system. In this paper we conduct some qualitative investigations of this kind of evaluation for the particular case of spatial reasoning.
arXiv Detail & Related papers (2023-04-22T06:28:46Z)
Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z)
Characteristics of Harmful Text: Towards Rigorous Benchmarking of Language Models [32.960462266615096]
Large language models produce human-like text that drive a growing number of applications. Recent literature and, increasingly, real world observations have demonstrated that these models can generate language that is toxic, biased, untruthful or otherwise harmful. We outline six ways of characterizing harmful text which merit explicit consideration when designing new benchmarks.
arXiv Detail & Related papers (2022-06-16T17:28:01Z)
AES Systems Are Both Overstable And Oversensitive: Explaining Why And Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models. Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models. We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
Do Fine-tuned Commonsense Language Models Really Generalize? [8.591839265985412]
We study the generalization issue in detail by designing and conducting a rigorous scientific study. We find clear evidence that fine-tuned commonsense language models still do not generalize well, even with moderate changes to the experimental setup.
arXiv Detail & Related papers (2020-11-18T08:52:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.