Changing Answer Order Can Decrease MMLU Accuracy
- URL: http://arxiv.org/abs/2406.19470v2
- Date: Mon, 11 Nov 2024 02:27:54 GMT
- Title: Changing Answer Order Can Decrease MMLU Accuracy
- Authors: Vipul Gupta, David Pantoja, Candace Ross, Adina Williams, Megan Ung,
- Abstract summary: We investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU.
When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive.
- Score: 18.774650080306944
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As large language models (LLMs) have grown in prevalence, particular benchmarks have become essential for the evaluation of these models and for understanding model capabilities. Most commonly, we use test accuracy averaged across multiple subtasks in order to rank models on leaderboards, to determine which model is best for our purposes. In this paper, we investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive. These findings suggest a possible adjustment to the standard practice of leaderboard testing, where we additionally consider the percentage of examples each model answers correctly by random chance.
Related papers
- ABCD: All Biases Come Disguised [4.603755953026689]
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice.<n>We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels.<n>We show that this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3times$ with only a minimal decrease in the mean model's performance.
arXiv Detail & Related papers (2026-02-19T15:12:33Z) - Don't Always Pick the Highest-Performing Model: An Information Theoretic View of LLM Ensemble Selection [8.266188814122605]
Large language models (LLMs) are often ensembled together to improve overall reliability and robustness, but in practice models are strongly correlated.<n>We formulate budgeted ensemble selection as maximizing the mutual information between the true label and predictions of the selected models.<n>Motivated by these, we propose a simple greedy mutual-information selection algorithm that estimates the required information terms directly from data.
arXiv Detail & Related papers (2026-02-08T15:05:22Z) - The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks [0.013048920509133805]
We present a study on known, open-source LLMs responding to 10 repetitions of questions from the benchmarks MMLU-Redux and MedQA.<n>Results show that the number of questions which can be answered consistently vary considerably among models.<n>Results for medium-sized models seem to indicate much higher levels of answer consistency.
arXiv Detail & Related papers (2025-09-05T17:31:14Z) - Ranked from Within: Ranking Large Multimodal Models for Visual Question Answering Without Labels [64.94853276821992]
Large multimodal models (LMMs) are increasingly deployed across diverse applications.
Traditional evaluation methods are largely dataset-centric, relying on fixed, labeled datasets and supervised metrics.
We explore unsupervised model ranking for LMMs by leveraging their uncertainty signals, such as softmax probabilities.
arXiv Detail & Related papers (2024-12-09T13:05:43Z) - Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.34646723046073]
Video Language Models (VLMs) are designed to answer complex video-focused questions.
Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.
This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z) - Precise Model Benchmarking with Only a Few Observations [6.092112060364272]
We propose an empirical Bayes (EB) estimator that balances direct and regression estimates for each subgroup separately.
EB consistently provides more precise estimates of the LLM performance compared to the direct and regression approaches.
arXiv Detail & Related papers (2024-10-07T17:26:31Z) - Introducing Flexible Monotone Multiple Choice Item Response Theory Models and Bit Scales [0.0]
We present a new model for multiple choice data, the monotone multiple choice (MMC) model, which we fit using autoencoders.
We demonstrate empirically that the MMC model outperforms the traditional nominal response IRT model in terms of fit.
arXiv Detail & Related papers (2024-10-02T12:33:16Z) - Assessing Model Generalization in Vicinity [34.86022681163714]
This paper evaluates the generalization ability of classification models on out-of-distribution test sets without depending on ground truth labels.
We propose incorporating responses from neighboring test samples into the correctness assessment of each individual sample.
The resulting scores are then averaged across all test samples to provide a holistic indication of model accuracy.
arXiv Detail & Related papers (2024-06-13T15:58:37Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Beyond Accuracy: A Consolidated Tool for Visual Question Answering
Benchmarking [30.155625852894797]
We propose a browser-based benchmarking tool for researchers and challenge organizers.
Our tool helps test generalization capabilities of models across multiple datasets.
Interactive filtering facilitates discovery of problematic behavior.
arXiv Detail & Related papers (2021-10-11T11:08:35Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - One for More: Selecting Generalizable Samples for Generalizable ReID
Model [92.40951770273972]
This paper proposes a one-for-more training objective that takes the generalization ability of selected samples as a loss function.
Our proposed one-for-more based sampler can be seamlessly integrated into the ReID training framework.
arXiv Detail & Related papers (2020-12-10T06:37:09Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z) - ManyModalQA: Modality Disambiguation and QA over Diverse Inputs [73.93607719921945]
We present a new multimodal question answering challenge, ManyModalQA, in which an agent must answer a question by considering three distinct modalities.
We collect our data by scraping Wikipedia and then utilize crowdsourcing to collect question-answer pairs.
arXiv Detail & Related papers (2020-01-22T14:39:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.