Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?
- URL: http://arxiv.org/abs/2407.01992v1
- Date: Tue, 2 Jul 2024 07:06:53 GMT
- Title: Is Your Large Language Model Knowledgeable or a Choices-Only Cheater?
- Authors: Nishant Balepur, Rachel Rudinger,
- Abstract summary: Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices.
We use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA.
After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices.
- Score: 16.384333600053342
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent work shows that large language models (LLMs) can answer multiple-choice questions using only the choices, but does this mean that MCQA leaderboard rankings of LLMs are largely influenced by abilities in choices-only settings? To answer this, we use a contrast set that probes if LLMs over-rely on choices-only shortcuts in MCQA. While previous works build contrast sets via expensive human annotations or model-generated data which can be biased, we employ graph mining to extract contrast sets from existing MCQA datasets. We use our method on UnifiedQA, a group of six commonsense reasoning datasets with high choices-only accuracy, to build an 820-question contrast set. After validating our contrast set, we test 12 LLMs, finding that these models do not exhibit reliance on choice-only shortcuts when given both the question and choices. Thus, despite the susceptibility~of MCQA to high choices-only accuracy, we argue that LLMs are not obtaining high ranks on MCQA leaderboards just due to their ability to exploit choices-only shortcuts.
Related papers
- Listening to the Wise Few: Select-and-Copy Attention Heads for Multiple-Choice QA [19.78468832417275]
We introduce new scores that better capture and reveal model's underlying knowledge.
Based on these scores, our method improves knowledge extraction, yielding up to 16% gain for LLaMA2-7B.
The accuracy on a simple synthetic dataset, where the model explicitly knows the right answer, increases by almost 60%.
arXiv Detail & Related papers (2024-10-03T09:53:48Z) - Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer.
Existing models often rank each choice separately, overlooking the context provided by other choices.
We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z) - Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs)
LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities.
This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z) - Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? [15.308093827770474]
We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts.
This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain.
We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
arXiv Detail & Related papers (2024-02-19T19:38:58Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena [76.21004582932268]
We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases.
We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Arena, a crowdsourced battle platform.
arXiv Detail & Related papers (2023-06-09T05:55:52Z) - Leveraging Large Language Models for Multiple Choice Question Answering [6.198523595657983]
We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach.
We show that a model with high MCSB ability performs much better with the natural approach than with the traditional approach.
arXiv Detail & Related papers (2022-10-22T05:04:54Z) - Language Prior Is Not the Only Shortcut: A Benchmark for Shortcut
Learning in VQA [53.45074798673808]
VQA models are prone to learn the shortcut solution formed by dataset biases rather than the intended solution.
We propose a new dataset that considers varying types of shortcuts by constructing different distribution shifts in multiple OOD test sets.
Our benchmark provides a more rigorous and comprehensive testbed for shortcut learning in VQA.
arXiv Detail & Related papers (2022-10-10T13:39:08Z) - True Few-Shot Learning with Language Models [78.42578316883271]
We evaluate the few-shot ability of LMs when held-out examples are unavailable.
Our findings suggest that prior work significantly overestimated the true few-shot ability of LMs.
arXiv Detail & Related papers (2021-05-24T17:55:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.