Changing Answer Order Can Decrease MMLU Accuracy
- URL: http://arxiv.org/abs/2406.19470v2
- Date: Mon, 11 Nov 2024 02:27:54 GMT
- Title: Changing Answer Order Can Decrease MMLU Accuracy
- Authors: Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, Megan Ung,
- Abstract summary: We investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU.
When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive.
- Score: 18.774650080306944
- License:
- Abstract: As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.
Related papers
- Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.34646723046073]
Video Language Models (VLMs) are designed to answer complex video-focused questions.
Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.
This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z) - Precise Model Benchmarking with Only a Few Observations [6.092112060364272]
We propose an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately.
EB consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches.
arXiv Detail & Related papers (2024-10-07T17:26:31Z) - Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales [0.0]
We present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders.
We demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit.
arXiv Detail & Related papers (2024-10-02T12:33:16Z) - Assessing Model Generalization in Vicinity [34.86022681163714]
This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels.
We propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample.
The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy.
arXiv Detail & Related papers (2024-06-13T15:58:37Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.