Large Language Models Sensitivity to The Order of Options in
Multiple-Choice Questions
- URL: http://arxiv.org/abs/2308.11483v1
- Date: Tue, 22 Aug 2023 14:54:59 GMT
- Title: Large Language Models Sensitivity to The Order of Options in
Multiple-Choice Questions
- Authors: Pouya Pezeshkpour, Estevam Hruschka
- Abstract summary: Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks.
Previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order.
This paper investigates sensitivity of LLMs towards the order of options in multiple-choice questions.
- Score: 5.187383020960245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in
various NLP tasks. However, previous works have shown these models are
sensitive towards prompt wording, and few-shot demonstrations and their order,
posing challenges to fair assessment of these models. As these models become
more powerful, it becomes imperative to understand and address these
limitations. In this paper, we focus on LLMs robustness on the task of
multiple-choice questions -- commonly adopted task to study reasoning and
fact-retrieving capability of LLMs. Investigating the sensitivity of LLMs
towards the order of options in multiple-choice questions, we demonstrate a
considerable performance gap of approximately 13% to 75% in LLMs on different
benchmarks, when answer options are reordered, even when using demonstrations
in a few-shot setting. Through a detailed analysis, we conjecture that this
sensitivity arises when LLMs are uncertain about the prediction between the
top-2/3 choices, and specific options placements may favor certain prediction
between those top choices depending on the question caused by positional bias.
We also identify patterns in top-2 choices that amplify or mitigate the model's
bias toward option placement. We found that for amplifying bias, the optimal
strategy involves positioning the top two choices as the first and last
options. Conversely, to mitigate bias, we recommend placing these choices among
the adjacent options. To validate our conjecture, we conduct various
experiments and adopt two approaches to calibrate LLMs' predictions, leading to
up to 8 percentage points improvement across different models and benchmarks.
Related papers
- Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes.
We find that the majority of disagreements are in opposition with standard reward modeling approaches.
We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z) - Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.34646723046073]
Video Language Models (VLMs) are designed to answer complex video-focused questions.
Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.
This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z) - LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science.
Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z) - Grade Score: Quantifying LLM Performance in Option Selection [0.0]
"Grade Score" is a novel metric designed to evaluate the consistency and fairness of Large Language Models (LLMs)
The Grade Score combines Entropy, which measures order bias, and Mode Frequency, which assesses choice stability.
The study explores techniques such as prompt engineering and option sampling strategies to optimize the Grade Score.
arXiv Detail & Related papers (2024-06-17T19:29:39Z) - Deep Bayesian Active Learning for Preference Modeling in Large Language Models [84.817400962262]
We propose the Bayesian Active Learner for Preference Modeling (BAL-PM) for Preference Modeling.
BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous Bayesian acquisition policies.
Our experiments demonstrate that BAL-PM requires 33% to 68% fewer preference labels in two popular human preference datasets and exceeds previous Bayesian acquisition policies.
arXiv Detail & Related papers (2024-06-14T13:32:43Z) - Unveiling Selection Biases: Exploring Order and Token Sensitivity in Large Language Models [24.300350113903768]
We investigate "selection biases" in Large Language Models (LLMs)
We quantify the impact of these biases through an extensive empirical analysis across multiple models and tasks.
We propose mitigation strategies to enhance model performance.
arXiv Detail & Related papers (2024-06-05T07:16:51Z) - Strengthened Symbol Binding Makes Large Language Models Reliable Multiple-Choice Selectors [11.470005425117371]
Multiple-Choice Questions (MCQs) constitute a critical area of research in the study of Large Language Models (LLMs)
We introduce an efficient Supervised Fine-Tuning algorithm for MCQs, termed Point-wise Intelligent Feedback (PIF)
arXiv Detail & Related papers (2024-06-03T06:20:12Z) - Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs [56.526095828316386]
We propose a novel framework for adaptation with self-evaluation to improve the selective prediction performance of large language models (LLMs)
We evaluate our method on a variety of question-answering (QA) datasets and show that it outperforms state-of-the-art selective prediction methods.
arXiv Detail & Related papers (2023-10-18T03:34:59Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - True Few-Shot Learning with Language Models [78.42578316883271]
We evaluate the few-shot ability of LMs when held-out examples are unavailable.
Our findings suggest that prior work significantly overestimated the true few-shot ability of LMs.
arXiv Detail & Related papers (2021-05-24T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.