Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
- URL: http://arxiv.org/abs/2510.14773v1
- Date: Thu, 16 Oct 2025 15:09:22 GMT
- Title: Finding Answers in Thought Matters: Revisiting Evaluation on Large Language Models with Reasoning
- Authors: Hwiyeol Jo, Joosung Lee, Jaehone Lee, Sang-Woo Lee, Joonsuk Park, Kang Min Yoo,
- Abstract summary: We propose a basic framework: Answer Regeneration.<n>The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:"<n>We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness.
- Score: 23.867629719024325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluating generative models, such as large language models (LLMs), commonly involves question-answering tasks where the final answer is selected based on probability of answer choices. On the other hand, for models requiring reasoning, the method of answer extraction plays a critical role. Our research reveals that the performance of reasoning models and their final answer distributions are highly sensitive to the answer extraction algorithm employed. In order to mitigate this, we propose a basic framework: Answer Regeneration. The method uses an additional model inference, providing the prior input and output prefaced by the prompt "Answer:". The final answer is then selected or extracted from the regenerated output. We show that this extraction-rule-agnostic approach exhibits improved performance and enhanced robustness. Furthermore, we have applied this framework to general math problems and open-ended question answering tasks. Our analysis and this framework could offer a more reliable results for model evaluation.
Related papers
- Answer-Consistent Chain-of-thought Reinforcement Learning For Multi-modal Large Langauge Models [33.398631680508814]
We propose Answer-Consistent Reinforcement Learning that modifies the GRPO algorithm with an auxiliary consistency check.<n>We design a consistency-verification reward that grants a high reward only if both the original and the post-shuffle answers agree and are correct.<n>We evaluate ACRE on challenging Video Reasoning benchmarks and multimodal math reasoning benchmarks, achieving an average 2.2% and 1.5% improvement.
arXiv Detail & Related papers (2025-10-11T08:32:52Z) - First Try Matters: Revisiting the Role of Reflection in Reasoning Models [66.39546876232512]
We focus on reflective behaviours where the model has already produced an answer but continues reflecting before finalizing its output.<n>Our analysis reveals that reflections are predominantly confirmatory and rarely alter the model's initial answer.<n>We propose a question-aware early-stopping method that enhances inference-time token efficiency by stopping the reasoning process once a few plausible candidate answers are generated.
arXiv Detail & Related papers (2025-10-09T14:57:10Z) - Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion?<n>Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues.<n>We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering [33.2921120857455]
We show that increasing compute budget at inference time helps models answer more questions correctly.<n>We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk.
arXiv Detail & Related papers (2025-02-19T18:58:31Z) - PRefLexOR: Preference-based Recursive Language Modeling for Exploratory Optimization of Reasoning and Agentic Thinking [0.0]
PRefLexOR combines preference optimization with concepts from Reinforcement Learning to enable models to self-teach.
We focus on applications in biological materials science and demonstrate the method in a variety of case studies.
arXiv Detail & Related papers (2024-10-16T08:46:26Z) - Answering Ambiguous Questions via Iterative Prompting [84.3426020642704]
In open-domain question answering, due to the ambiguity of questions, multiple plausible answers may exist.
One approach is to directly predict all valid answers, but this can struggle with balancing relevance and diversity.
We present AmbigPrompt to address the imperfections of existing approaches to answering ambiguous questions.
arXiv Detail & Related papers (2023-07-08T04:32:17Z) - Improving Passage Retrieval with Zero-Shot Question Generation [109.11542468380331]
We propose a simple and effective re-ranking method for improving passage retrieval in open question answering.
The re-ranker re-scores retrieved passages with a zero-shot question generation model, which uses a pre-trained language model to compute the probability of the input question conditioned on a retrieved passage.
arXiv Detail & Related papers (2022-04-15T14:51:41Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z) - Cooperative Learning of Zero-Shot Machine Reading Comprehension [9.868221447090855]
We propose a cooperative, self-play learning model for question generation and answering.
We can train question generation and answering models on any textual corpora without annotation.
Our model outperforms the state-of-the-art pretrained language models on standard question answering benchmarks.
arXiv Detail & Related papers (2021-03-12T18:22:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.