Related papers: "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

"My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models

URL: http://arxiv.org/abs/2402.14499v2
Date: Thu, 4 Jul 2024 12:51:29 GMT
Title: "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models
Authors: Xinpeng Wang, Bolei Ma, Chengzhi Hu, Leon Weber-Genzel, Paul Röttger, Frauke Kreuter, Dirk Hovy, Barbara Plank,
Abstract summary: Open-ended nature of language generation makes evaluation of large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. We evaluate how aligned first-token evaluation is with the text output along several dimensions.
Score: 40.867655189493924
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The open-ended nature of language generation makes the evaluation of autoregressive large language models (LLMs) challenging. One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space. The model is then evaluated by ranking the candidate answers by the log probability of the first token prediction. However, first-tokens may not consistently reflect the final response output, due to model's diverse response styles such as starting with "Sure" or refusing to answer. Consequently, MCQ evaluation is not indicative of model behaviour when interacting with users. But by how much? We evaluate how aligned first-token evaluation is with the text output along several dimensions, namely final option choice, refusal rate, choice distribution and robustness under prompt perturbation. Our results show that the two approaches are severely misaligned on all dimensions, reaching mismatch rates over 60%. Models heavily fine-tuned on conversational or safety data are especially impacted. Crucially, models remain misaligned even when we increasingly constrain prompts, i.e., force them to start with an option letter or example template. Our findings i) underscore the importance of inspecting the text output as well and ii) caution against relying solely on first-token evaluation.

Related papers

Answer Matching Outperforms Multiple Choice for Language Model Evaluation [35.90520208701438]
We show multiple choice questions from popular benchmarks can often be answered without even seeing the question.<n>We consider generative evaluation via what we call answer matching.
arXiv Detail & Related papers (2025-07-03T17:59:02Z)
Beyond the Last Answer: Your Reasoning Trace Uncovers More than You Think [51.0691253204425]
We analyze intermediate reasoning steps, termed subthoughts, to answer two questions: Does the final answer reliably represent the model's optimal conclusion? Our approach involves segmenting a reasoning trace into sequential subthoughts based on linguistic cues. We find that aggregating these answers by selecting the most frequent one (the mode) often yields significantly higher accuracy compared to relying solely on the answer derived from the original complete trace.
arXiv Detail & Related papers (2025-04-29T12:39:07Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
Language Models Can Predict Their Own Behavior [28.80639362933004]
We show that internal representation of input tokens alone can often precisely predict, not just the next token, but eventual behavior over the entire output sequence. We leverage this capacity and learn probes on internal states to create early warning (and exit) systems. Specifically, if the probes can confidently estimate the way the LM is going to behave, then the system will avoid generating tokens altogether and return the estimated behavior instead.
arXiv Detail & Related papers (2025-02-18T23:13:16Z)
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations [85.81295563405433]
Language model users often issue queries that lack specification, where the context under which a query was issued is not explicit. We present contextualized evaluations, a protocol that synthetically constructs context surrounding an under-specified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts.
arXiv Detail & Related papers (2024-11-11T18:58:38Z)
Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think [27.595110330513567]
We show that the text answers are more robust to question perturbations than the first token probabilities. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.
arXiv Detail & Related papers (2024-04-12T10:36:15Z)
What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers. We present users with incorrect answers and corresponding rationales in various formats. We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z)
Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models [32.843361525236965]
We analyze the effect of sparse feedback on the alignment and evaluation of large language models. We find that preferences from ratings and rankings significantly disagree 60% for both human and AI annotators. Our findings shed light on critical gaps in methods for evaluating the real-world utility of language models.
arXiv Detail & Related papers (2023-08-30T07:35:32Z)
I'm Afraid I Can't Do That: Predicting Prompt Refusal in Black-Box Generative Language Models [0.0]
We characterize ChatGPT's refusal behavior using a black-box attack. We map several different kinds of responses to a binary of compliance or refusal. We train a prompt classifier to predict whether ChatGPT will refuse a question, without seeing ChatGPT's response.
arXiv Detail & Related papers (2023-06-06T05:50:58Z)
Large Language Models are not Fair Evaluators [60.27164804083752]
We find that the quality ranking of candidate responses can be easily hacked by altering their order of appearance in the context. This manipulation allows us to skew the evaluation result, making one model appear considerably superior to the other. We propose a framework with three simple yet effective strategies to mitigate this issue.
arXiv Detail & Related papers (2023-05-29T07:41:03Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z)
Double Perturbation: On the Robustness of Robustness and Counterfactual Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset. We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.