Related papers: Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?

URL: http://arxiv.org/abs/2402.12483v2
Date: Fri, 7 Jun 2024 23:11:14 GMT
Title: Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question?
Authors: Nishant Balepur, Abhilasha Ravichander, Rachel Rudinger,
Abstract summary: We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts. This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
Score: 15.308093827770474
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple-choice question answering (MCQA) is often used to evaluate large language models (LLMs). To see if MCQA assesses LLMs as intended, we probe if LLMs can perform MCQA with choices-only prompts, where models must select the correct answer only from the choices. In three MCQA datasets and four LLMs, this prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. To help explain this behavior, we conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference. Our key findings are threefold. First, we find no evidence that the choices-only accuracy stems from memorization alone. Second, priors over individual choices do not fully explain choices-only accuracy, hinting that LLMs use the group dynamics of choices. Third, LLMs have some ability to infer a relevant question from choices, and surprisingly can sometimes even match the original question. Inferring the original question is an impressive reasoning strategy, but it cannot fully explain the high choices-only accuracy of LLMs in MCQA. Thus, while LLMs are not fully incapable of reasoning in MCQA, we still advocate for the use of stronger baselines in MCQA benchmarks, the design of robust MCQA datasets for fair evaluations, and further efforts to explain LLM decision-making.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Multiple Choice Questions: Reasoning Makes Large Language Models (LLMs) More Self-Confident Even When They Are Wrong [2.8367942280334493]
We study how the confidence in the answer depends on whether the model has been asked to answer directly or to provide the reasoning before answering. Our hypothesis is that this behavior is due to the reasoning that modifies the probability of the selected answer.
arXiv Detail & Related papers (2025-01-16T10:27:51Z)
Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically. The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z)
Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z)
Is Your Large Language Model Knowledgeable or a Choices-Only Cheater? [16.384333600053342]
Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices. We use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices.
arXiv Detail & Related papers (2024-07-02T07:06:53Z)
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs) LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z)
Perception of Knowledge Boundary for Large Language Models through Semi-open-ended Question Answering [67.94354589215637]
Large Language Models (LLMs) are widely used for knowledge-seeking yet suffer from hallucinations. In this paper, we perceive the LLMs' knowledge boundary (KB) with semi-open-ended questions (SoeQ) We find that GPT-4 performs poorly on SoeQ and is often unaware of its KB. Our auxiliary model, LLaMA-2-13B, is effective in discovering more ambiguous answers.
arXiv Detail & Related papers (2024-05-23T10:00:14Z)
Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs) The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy. We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z)
Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers. We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z)
Beyond the Answers: Reviewing the Rationality of Multiple Choice Question Answering for the Evaluation of Large Language Models [29.202758753639078]
This study investigates the limitations of Multiple Choice Question Answering (MCQA) as an evaluation method for Large Language Models (LLMs) We propose a dataset augmenting method for Multiple-Choice Questions (MCQs), MCQA+, that can more accurately reflect the performance of the model.
arXiv Detail & Related papers (2024-02-02T12:07:00Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)
Leveraging Large Language Models for Multiple Choice Question Answering [6.198523595657983]
We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach. We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach.
arXiv Detail & Related papers (2022-10-22T05:04:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.