Related papers: Large Language Models Are Not Robust Multiple Choice Selectors

Large Language Models Are Not Robust Multiple Choice Selectors

URL: http://arxiv.org/abs/2309.03882v4
Date: Thu, 22 Feb 2024 01:40:35 GMT
Title: Large Language Models Are Not Robust Multiple Choice Selectors
Authors: Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang
Abstract summary: Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
Score: 117.72712117510953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs). This work shows that modern LLMs are vulnerable to option position changes in MCQs due to their inherent "selection bias", namely, they prefer to select specific option IDs as answers (like "Option A"). Through extensive empirical analyses with 20 LLMs on three benchmarks, we pinpoint that this behavioral bias primarily stems from LLMs' token bias, where the model a priori assigns more probabilistic mass to specific option ID tokens (e.g., A/B/C/D) when predicting answers from the option IDs. To mitigate selection bias, we propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution. PriDe first estimates the prior by permutating option contents on a small number of test samples, and then applies the estimated prior to debias the remaining samples. We demonstrate that it achieves interpretable and transferable debiasing with high computational efficiency. We hope this work can draw broader research attention to the bias and robustness of modern LLMs.

Related papers

SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models [0.27309692684728604]
Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels.<n>This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner.
arXiv Detail & Related papers (2025-07-24T08:28:17Z)
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.34646723046073]
Video Language Models (VLMs) are designed to answer complex video-focused questions. Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias. This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z)
Mitigating Selection Bias with Node Pruning and Auxiliary Options [11.835002896308545]
Large language models (LLMs) often show unwarranted preference for certain choice options when responding to multiple-choice questions. Previous solutions utilized debiasing methods to adjust the model's input and/or output. Our work, in contrast, investigates the model's internal representation of the selection bias.
arXiv Detail & Related papers (2024-09-27T15:53:54Z)
Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z)
Evaluating Nuanced Bias in Large Language Model Free Response Answers [8.775925011558995]
We identify several kinds of nuanced bias in free text that cannot be identified by multiple choice tests. We present a semi-automated pipeline for detecting these types of bias by first eliminating answers that can be automatically classified as unbiased.
arXiv Detail & Related papers (2024-07-11T19:58:13Z)
LLM-Select: Feature Selection with Large Language Models [64.5099482021597]
Large language models (LLMs) are capable of selecting the most predictive features, with performance rivaling the standard tools of data science. Our findings suggest that LLMs may be useful not only for selecting the best features for training but also for deciding which features to collect in the first place.
arXiv Detail & Related papers (2024-07-02T22:23:40Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Pseudo Label Selection is a Decision Problem [0.0]
Pseudo-Labeling is a simple and effective approach to semi-supervised learning. It requires criteria that guide the selection of pseudo-labeled data. Overfitting can be propagated to the final model by choosing instances with overconfident but wrong predictions.
arXiv Detail & Related papers (2023-09-25T07:48:02Z)
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions [5.187383020960245]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. Previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order. This paper investigates sensitivity of LLMs towards the order of options in multiple-choice questions.
arXiv Detail & Related papers (2023-08-22T14:54:59Z)
Improving Selective Visual Question Answering by Learning from Your Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong. We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z)
True Few-Shot Learning with Language Models [78.42578316883271]
We evaluate the few-shot ability of LMs when held-out examples are unavailable. Our findings suggest that prior work significantly overestimated the true few-shot ability of LMs.
arXiv Detail & Related papers (2021-05-24T17:55:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.