Related papers: Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs

URL: http://arxiv.org/abs/2509.15020v1
Date: Thu, 18 Sep 2025 14:47:58 GMT
Title: Mind the Gap: A Closer Look at Tokenization for Multiple-Choice Question Answering with LLMs
Authors: Mario Sanz-Guerrero, Minh Duc Bui, Katharina von der Wense,
Abstract summary: There is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice.<n>Surprisingly, we are able to recommend one specific strategy -- tokenizing the space together with the answer letter.<n>Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols.
Score: 16.357595595062946
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When evaluating large language models (LLMs) with multiple-choice question answering (MCQA), it is common to end the prompt with the string "Answer:" to facilitate automated answer extraction via next-token probabilities. However, there is no consensus on how to tokenize the space following the colon, often overlooked as a trivial choice. In this paper, we uncover accuracy differences of up to 11% due to this (seemingly irrelevant) tokenization variation as well as reshuffled model rankings, raising concerns about the reliability of LLM comparisons in prior work. Surprisingly, we are able to recommend one specific strategy -- tokenizing the space together with the answer letter -- as we observe consistent and statistically significant performance improvements. Additionally, it improves model calibration, enhancing the reliability of the model's confidence estimates. Our findings underscore the importance of careful evaluation design and highlight the need for standardized, transparent evaluation protocols to ensure reliable and comparable results.

Related papers

Sample Smart, Not Hard: Correctness-First Decoding for Better Reasoning in LLMs [72.82403830490084]
We argue that the decoding rule should be calibrated by correctness, not confidence alone.<n>We propose simple strategies that achieve this goal: Greedy-Threshold makes sampling greedy at very low confidence steps.<n>Together, our findings challenge prevailings about decoding under uncertainty and show gains across math and general reasoning benchmarks.
arXiv Detail & Related papers (2025-10-07T14:46:12Z)
Reference-Free Rating of LLM Responses via Latent Information [53.463883683503106]
We study the common practice of asking a judge model to assign Likert-scale scores to free-text responses.<n>We then propose and evaluate Latent Judges, which derive scalar ratings from internal model signals.<n>Across a broad suite of pairwise and single-rating benchmarks, latent methods match or surpass standard prompting.
arXiv Detail & Related papers (2025-09-29T12:15:52Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
SCOPE: Stochastic and Counterbiased Option Placement for Evaluating Large Language Models [0.27309692684728604]
Large Language Models (LLMs) can achieve inflated scores on multiple-choice tasks by exploiting inherent biases in option positions or labels.<n>This study introduces SCOPE, an evaluation framework designed to measure and mitigate such selection bias in a dataset-independent manner.
arXiv Detail & Related papers (2025-07-24T08:28:17Z)
Cautious Next Token Prediction [62.74127603725369]
We propose a new training-free decoding strategy, dubbed as Cautious Next Token Prediction (CNTP)<n>In the decoding process, if the model has comparatively high prediction entropy at a certain step, we sample multiple trials starting from the step independently and stop when encountering any punctuation.<n>We show that our proposed CNTP approach outperforms existing standard decoding strategies consistently by a clear margin.
arXiv Detail & Related papers (2025-07-03T05:49:18Z)
Improving LLM First-Token Predictions in Multiple-Choice Question Answering via Prefilling Attack [44.205352310633174]
Large Language Models (LLMs) are increasingly evaluated on multiple-choice question answering (MCQA) tasks.<n>We propose a solution: the *prefilling attack*, a structured natural-language prefix (e.g., "*The correct option is:*") prepended to the model output.<n>Our findings suggest that prefilling is a simple, robust, and low-cost method to enhance the reliability of FTP-based evaluation in multiple-choice settings.
arXiv Detail & Related papers (2025-05-21T09:58:38Z)
Language Model Uncertainty Quantification with Attention Chain [9.093726246465117]
Large language models' (LLM) predictive uncertainty is crucial for judging the reliability of its answers.<n>We propose UQAC, an efficient method that narrows the reasoning space to a tractable size for marginalization.<n>We validate UQAC on multiple reasoning benchmarks with advanced open-source LLMs.
arXiv Detail & Related papers (2025-03-24T21:43:47Z)
On Verbalized Confidence Scores for LLMs [25.160810008907397]
Uncertainty quantification for large language models (LLMs) can establish more human trust into their responses.<n>This work focuses on asking the LLM itself to verbalize its uncertainty with a confidence score as part of its output tokens.<n>We assess the reliability of verbalized confidence scores with respect to different datasets, models, and prompt methods.
arXiv Detail & Related papers (2024-12-19T11:10:36Z)
Decompose and Compare Consistency: Measuring VLMs' Answer Reliability via Task-Decomposition Consistency Comparison [22.438863942925973]
We propose Decompose and Compare Consistency (DeCC) for reliability measurement. By comparing the consistency between the direct answer generated using the VLM's internal reasoning process, DeCC measures the reliability of VLM's direct answer.
arXiv Detail & Related papers (2024-07-10T17:00:29Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Multicalibration for Confidence Scoring in LLMs [6.948522445499497]
This paper proposes the use of "multicalibration" to yield interpretable and reliable confidence scores for outputs generated by large language models (LLMs) We show how to form groupings for prompt/completion pairs that are correlated with the probability of correctness via two techniques: clustering within an embedding space, and "self-annotation" We show how our techniques can yield confidence scores that provide substantial improvements in fine-grained measures of both calibration and accuracy compared to existing methods.
arXiv Detail & Related papers (2024-04-06T17:33:37Z)
Self-Evaluation Improves Selective Generation in Large Language Models [54.003992911447696]
We reformulate open-ended generation tasks into token-level prediction tasks. We instruct an LLM to self-evaluate its answers. We benchmark a range of scoring methods based on self-evaluation.
arXiv Detail & Related papers (2023-12-14T19:09:22Z)
Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs. Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.