Related papers: The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks

URL: http://arxiv.org/abs/2509.09705v1
Date: Fri, 05 Sep 2025 17:31:14 GMT
Title: The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks
Authors: Claudio Pinhanez, Paulo Cavalin, Cassia Sanctos, Marcelo Grave, Yago Primerano,
Abstract summary: We present a study on known, open-source LLMs responding to 10 repetitions of questions from the benchmarks MMLU-Redux and MedQA.<n>Results show that the number of questions which can be answered consistently vary considerably among models.<n>Results for medium-sized models seem to indicate much higher levels of answer consistency.
Score: 0.013048920509133805
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This work explores the consistency of small LLMs (2B-8B parameters) in answering multiple times the same question. We present a study on known, open-source LLMs responding to 10 repetitions of questions from the multiple-choice benchmarks MMLU-Redux and MedQA, considering different inference temperatures, small vs. medium models (50B-80B), finetuned vs. base models, and other parameters. We also look into the effects of requiring multi-trial answer consistency on accuracy and the trade-offs involved in deciding which model best provides both of them. To support those studies, we propose some new analytical and graphical tools. Results show that the number of questions which can be answered consistently vary considerably among models but are typically in the 50%-80% range for small models at low inference temperatures. Also, accuracy among consistent answers seems to reasonably correlate with overall accuracy. Results for medium-sized models seem to indicate much higher levels of answer consistency.

Related papers

ABCD: All Biases Come Disguised [4.603755953026689]
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice.<n>We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels.<n>We show that this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3times$ with only a minimal decrease in the mean model's performance.
arXiv Detail & Related papers (2026-02-19T15:12:33Z)
Uncertainty-Aware Answer Selection for Improved Reasoning in Multi-LLM Systems [55.6590601898194]
Large Language Models (LLMs) have demonstrated exceptional capabilities, yet selecting the most reliable response from multiple LLMs remains a challenge.<n>Existing approaches often depend on costly external verifiers, human evaluators, or self-consistency techniques that require multiple samples from a single model.<n>We propose a principled, novel and computationally efficient method to select the best response from multiple different LLMs using a calibrated log-likelihood score.
arXiv Detail & Related papers (2025-09-30T01:25:19Z)
Reasoning Models are Test Exploiters: Rethinking Multiple-Choice [12.317748510370238]
Large Language Models (LLMs) are asked to choose among a fixed set of choices.<n>Multiple-choice question-answering (McQCA) is a good proxy for the downstream performance of models.<n>This paper investigates the extent to which this trend continues to hold for state-of-the-art reasoning models.
arXiv Detail & Related papers (2025-07-21T07:49:32Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Evaluating Binary Decision Biases in Large Language Models: Implications for Fair Agent-Based Financial Simulations [15.379345372327375]
Large Language Models (LLMs) are increasingly being used to simulate human-like decision making in agent-based financial market models.<n>We test three state-of-the-art GPT models for bias using two model sampling approaches: one-shot and few-shot API queries.
arXiv Detail & Related papers (2025-01-20T10:36:51Z)
DiverseAgentEntropy: Quantifying Black-Box LLM Uncertainty through Diverse Perspectives and Multi-Agent Interaction [53.803276766404494]
Existing methods, which gauge a model's uncertainty through evaluating self-consistency in responses to the original query, do not always capture true uncertainty.<n>We propose a novel method, DiverseAgentEntropy, for evaluating a model's uncertainty using multi-agent interaction.<n>Our method offers a more accurate prediction of the model's reliability and further detects hallucinations, outperforming other self-consistency-based methods.
arXiv Detail & Related papers (2024-12-12T18:52:40Z)
Ranked from Within: Ranking Large Multimodal Models Without Labels [73.96543593298426]
We show that uncertainty scores derived from softmax distributions provide a robust basis for ranking models across various tasks.<n>This facilitates the ranking of LMMs on unlabeled data, providing a practical approach for selecting models for diverse target domains without requiring manual annotation.
arXiv Detail & Related papers (2024-12-09T13:05:43Z)
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.252597615544317]
Video Language Models (VLMs) are designed to answer complex video-focused questions.<n>Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.<n>This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z)
Uncertainty Estimation of Large Language Models in Medical Question Answering [60.72223137560633]
Large Language Models (LLMs) show promise for natural language generation in healthcare, but risk hallucinating factually incorrect information. We benchmark popular uncertainty estimation (UE) methods with different model sizes on medical question-answering datasets. Our results show that current approaches generally perform poorly in this domain, highlighting the challenge of UE for medical applications.
arXiv Detail & Related papers (2024-07-11T16:51:33Z)
Changing Answer Order Can Decrease MMLU Accuracy [18.774650080306944]
We investigate the robustness of the accuracy measurement on a widely used multiple choice question answering dataset, MMLU. When shuffling the answer label contents, we find that all explored models decrease in accuracy on MMLU, but not every model is equally sensitive.
arXiv Detail & Related papers (2024-06-27T18:21:32Z)
Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment [82.60594940370919]
We propose the FlipFlop experiment to study the multi-turn behavior of Large Language Models (LLMs) We show that models flip their answers on average 46% of the time and that all models see a deterioration of accuracy between their first and final prediction, with an average drop of 17% (the FlipFlop effect) We conduct finetuning experiments on an open-source LLM and find that finetuning on synthetically created data can mitigate - reducing performance deterioration by 60% - but not resolve sycophantic behavior entirely.
arXiv Detail & Related papers (2023-11-14T23:40:22Z)
Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks. We propose Sample-specific Ensemble of Source Models (SESoM) SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.