Related papers: Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers

Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers

URL: http://arxiv.org/abs/2510.07761v1
Date: Thu, 09 Oct 2025 04:00:09 GMT
Title: Test-Time Reasoners Are Strategic Multiple-Choice Test-Takers
Authors: Nishant Balepur, Atrey Desai, Rachel Rudinger,
Abstract summary: Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA)<n>Yet, work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only.<n>To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time.<n>While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces.
Score: 27.30313753837339
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) now give reasoning before answering, excelling in tasks like multiple-choice question answering (MCQA). Yet, a concern is that LLMs do not solve MCQs as intended, as work finds LLMs sans reasoning succeed in MCQA without using the question, i.e., choices-only. Such partial-input success is often deemed problematic, but reasoning traces could reveal if these strategies are truly shallow in choices-only settings. To study these strategies, reasoning LLMs solve MCQs in full and choices-only inputs; test-time reasoning often boosts accuracy on full and in choices-only half the time. While possibly due to shallow shortcuts, choices-only success is barely affected by the length of reasoning traces, and after finding traces pass faithfulness tests, we show they use less problematic strategies like inferring missing questions. In all, we challenge claims that partial-input success is always a flaw, so we discuss how reasoning traces could separate problematic data from less problematic reasoning.

Related papers

More Bias, Less Bias: BiasPrompting for Enhanced Multiple-Choice Question Answering [53.09478307383865]
We introduce BiasPrompting, a novel inference framework for large language models (LLMs)<n>It guides LLMs to generate and critically evaluate reasoning across all plausible answer options before reaching a final prediction.<n>It demonstrates significant improvements in five widely used multiple-choice question answering benchmarks.
arXiv Detail & Related papers (2025-11-25T09:01:08Z)
MMReason: An Open-Ended Multi-Modal Multi-Step Reasoning Benchmark for MLLMs Toward AGI [59.196131618912005]
Reasoning plays a crucial role in advancing Multimodal Large Language Models (MLLMs)<n>Existing MLLM benchmarks often fall short in precisely and comprehensively evaluating long-chain reasoning abilities.<n>We introduce MMReason, a new benchmark designed to precisely and comprehensively evaluate MLLM long-chain reasoning capability.
arXiv Detail & Related papers (2025-06-30T07:14:38Z)
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks? [15.390695446510405]
Real-world queries are often underspecified and only solvable by acquiring missing information.<n>We present QuestBench, a set of underspecified reasoning tasks solvable by asking at most one question.<n>Analysis shows that the ability to solve well-specified reasoning problems is not sufficient for success on our benchmark.
arXiv Detail & Related papers (2025-03-28T17:58:40Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Empowering LLMs with Logical Reasoning: A Comprehensive Survey [49.91445266392609]
Large language models (LLMs) have achieved remarkable successes on various tasks.<n>Recent studies have found that there are still significant challenges to the logical reasoning abilities of LLMs.
arXiv Detail & Related papers (2025-02-21T18:20:35Z)
Which of These Best Describes Multiple Choice Evaluation with LLMs? A) Forced B) Flawed C) Fixable D) All of the Above [14.5781090243416]
Multiple choice question answering (MCQA) is popular for LLM evaluation due to its simplicity and human-like testing.<n>We reveal flaws in MCQA's format, as it struggles to: 1) test generation/subjectivity; 2) match LLM use cases; and 3) fully test knowledge.<n>We advocate for generative formats based on human testing, where LLMs construct and explain answers, better capturing user needs and knowledge while remaining easy to score.
arXiv Detail & Related papers (2025-02-19T22:11:52Z)
On Memorization of Large Language Models in Logical Reasoning [70.94164038947078]
Large language models (LLMs) achieve good performance on challenging reasoning benchmarks, yet could also make basic reasoning mistakes.<n>One hypothesis is that the increasingly high and nearly saturated performance could be due to the memorization of similar problems.<n>We show that fine-tuning leads to heavy memorization, but it also consistently improves generalization performance.
arXiv Detail & Related papers (2024-10-30T15:31:54Z)
Can LLMs Reason in the Wild with Programs? [20.47557047823847]
We introduce the task of reasoning in the wild, where an LLM is tasked to solve a reasoning problem of unknown type. We create a large tactic-guided trajectory dataset containing detailed solutions to a diverse set of reasoning problems. In experiments, we highlight that existing LLMs fail significantly on problems with ambiguous and mixed scope.
arXiv Detail & Related papers (2024-06-19T18:26:19Z)
Open-LLM-Leaderboard: From Multi-choice to Open-style Questions for LLMs Evaluation, Benchmark, and Arena [23.264049073539663]
Multiple-choice questions (MCQ) are frequently used to assess large language models (LLMs) LLMs may inherently favor certain answer choice IDs, such as A/B/C/D, due to inherent biases of priori unbalanced probabilities. This work aims to tackle these significant difficulties, and establish a new LLM evaluation benchmark through entirely open-style questions.
arXiv Detail & Related papers (2024-06-11T17:59:47Z)
Artifacts or Abduction: How Do LLMs Answer Multiple-Choice Questions Without the Question? [15.308093827770474]
We probe if large language models (LLMs) can perform multiple-choice question answering (MCQA) with choices-only prompts. This prompt bests a majority baseline in 11/12 cases, with up to 0.33 accuracy gain. We conduct an in-depth, black-box analysis on memorization, choice dynamics, and question inference.
arXiv Detail & Related papers (2024-02-19T19:38:58Z)
DCR: Divide-and-Conquer Reasoning for Multi-choice Question Answering with LLMs [9.561022942046279]
We propose Divide and Conquer Reasoning (DCR) to enhance the reasoning capability of large language models (LLMs) We first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers. In particular, we first categorize questions into two subsets based on confidence score ($mathcalCS$), which is estimated by statistical frequency of generated answers.
arXiv Detail & Related papers (2024-01-10T14:38:46Z)
SelfCheck: Using LLMs to Zero-Shot Check Their Own Step-by-Step Reasoning [55.76083560152823]
SelfCheck is a general-purpose zero-shot verification schema for recognizing errors in step-by-step reasoning. We test SelfCheck on three datasets (GSM8K, MathQA, and MATH) and find that it successfully recognizes errors and, in turn, increases final answer accuracies.
arXiv Detail & Related papers (2023-08-01T10:31:36Z)
Large Language Models are Better Reasoners with Self-Verification [48.534270563880845]
Large language models (LLMs) have shown strong reasoning ability in several natural language processing tasks. LLMs with chain of thought (CoT) prompting require multi-step prompting and multi-token prediction, which is highly sensitive to individual mistakes. We propose and prove that LLMs also have similar self-verification abilities.
arXiv Detail & Related papers (2022-12-19T15:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.