Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
- URL: http://arxiv.org/abs/2502.14127v1
- Date: Wed, 19 Feb 2025 22:11:52 GMT
- Title: Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above
- Authors: Nishant Balepur, Rachel Rudinger, Jordan Lee Boyd-Graber,
- Abstract summary: Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing.
We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge.
- Score: 14.5781090243416
- License:
- Abstract: Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing, but we argue for its reform. We first reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge. We instead advocate for generative formats based on human testing-where LLMs construct and explain answers-better capturing user needs and knowledge while remaining easy to score. We then show even when MCQA is a useful format, its datasets suffer from: leakage; unanswerability; shortcuts; and saturation. In each issue, we give fixes from education, like rubrics to guide MCQ writing; scoring methods to bridle guessing; and Item Response Theory to build harder MCQs. Lastly, we discuss LLM errors in MCQA-robustness, biases, and unfaithful explanations-showing how our prior solutions better measure or address these issues. While we do not need to desert MCQA, we encourage more efforts in refining the task based on educational testing, advancing evaluations.
Related papers
- Affordably Fine-tuned LLMs Provide Better Answers to Course-specific MCQs [0.9217021281095907]
We study how Large Language Models (LLMs) answer multiple-choice questions (MCQs) with respect to hardware constraints and refinement techniques.
We explore this space by using generic pre-trained LLMs to answer 162 undergraduate-level MCQs from a Programming Languages (PL) course.
arXiv Detail & Related papers (2025-01-10T11:44:35Z) - Reverse Question Answering: Can an LLM Write a Question so Hard (or Bad) that it Can't Answer? [24.614521528699093]
We test reverse question answering (RQA): for an input answer, give a question with that answer.
By finding question and answer types that lead to RQA errors, we suggest improvements for LLM reasoning.
arXiv Detail & Related papers (2024-10-20T21:17:49Z) - Are LLMs Aware that Some Questions are not Open-ended? [58.93124686141781]
We study whether Large Language Models are aware that some questions have limited answers and need to respond more deterministically.
The lack of question awareness in LLMs leads to two phenomena: (1) too casual to answer non-open-ended questions or (2) too boring to answer open-ended questions.
arXiv Detail & Related papers (2024-10-01T06:07:00Z) - Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs)
LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities.
This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z) - Can multiple-choice questions really be useful in detecting the abilities of LLMs? [15.756543037102256]
Multiple-choice questions (MCQs) are widely used in the evaluation of large language models (LLMs)
The misalignment between the task and the evaluation method demands a thoughtful analysis of MCQ's efficacy.
We evaluate nine LLMs on four question-answering (QA) datasets in two languages: Chinese and English.
arXiv Detail & Related papers (2024-03-26T14:43:48Z) - Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? [15.308093827770474]
We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts.
This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain.
We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
arXiv Detail & Related papers (2024-02-19T19:38:58Z) - Rephrase and Respond: Let Large Language Models Ask Better Questions for Themselves [57.974103113675795]
We present a method named Rephrase and Respond' (RaR) which allows Large Language Models to rephrase and expand questions posed by humans.
RaR serves as a simple yet effective prompting method for improving performance.
We show that RaR is complementary to the popular Chain-of-Thought (CoT) methods, both theoretically and empirically.
arXiv Detail & Related papers (2023-11-07T18:43:34Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Won't Get Fooled Again: Answering Questions with False Premises [79.8761549830075]
Pre-trained language models (PLMs) have shown unprecedented potential in various fields.
PLMs tend to be easily deceived by tricky questions such as "How many eyes does the sun have?"
We find that the PLMs already possess the knowledge required to rebut such questions.
arXiv Detail & Related papers (2023-07-05T16:09:21Z) - Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks.
LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes.
We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z) - Unsupervised Multiple Choices Question Answering: Start Learning from
Basic Knowledge [75.7135212362517]
We study the possibility of almost unsupervised Multiple Choices Question Answering (MCQA)
The proposed method is shown to outperform the baseline approaches on RACE and even comparable with some supervised learning approaches on MC500.
arXiv Detail & Related papers (2020-10-21T13:44:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.