Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
- URL: http://arxiv.org/abs/2404.08382v2
- Date: Tue, 20 Aug 2024 08:07:49 GMT
- Title: Look at the Text: Instruction-Tuned Language Models are More Robust Multiple Choice Selectors than You Think
- Authors: Xinpeng Wang, Chengzhi Hu, Bolei Ma, Paul Röttger, Barbara Plank,
- Abstract summary: We show that the text answers are more robust to question perturbations than the first token probabilities.
Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.
- Score: 27.595110330513567
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multiple choice questions (MCQs) are commonly used to evaluate the capabilities of large language models (LLMs). One common way to evaluate the model response is to rank the candidate answers based on the log probability of the first token prediction. An alternative way is to examine the text output. Prior work has shown that first token probabilities lack robustness to changes in MCQ phrasing, and that first token probabilities do not match text answers for instruction-tuned models. Therefore, in this paper, we investigate the robustness of text answers. We show that the text answers are more robust to question perturbations than the first token probabilities, when the first token answers mismatch the text answers. The difference in robustness increases as the mismatch rate becomes greater. As the mismatch reaches over 50\%, the text answer is more robust to option order changes than the debiased first token probabilities using state-of-the-art debiasing methods such as PriDe. Our findings provide further evidence for the benefits of text answer evaluation over first token probability evaluation.
Related papers
- Bayesian inference to improve quality of Retrieval Augmented Generation [0.21756081703276]
Retrieval Augmented Generation or RAG is the most popular pattern for modern Large Language Model or LLM applications.
Bayes theorem tries to relate conditional probabilities of the hypothesis with evidence and prior probabilities.
We propose that, finding likelihood of text chunks to give a quality answer and using prior probability of quality of text chunks can help us improve overall quality of the responses from RAG systems.
arXiv Detail & Related papers (2024-08-12T08:54:32Z) - "My Answer is C": First-Token Probabilities Do Not Match Text Answers in Instruction-Tuned Language Models [40.867655189493924]
Open-ended nature of language generation makes evaluation of large language models (LLMs) challenging.
One common evaluation approach uses multiple-choice questions (MCQ) to limit the response space.
We evaluate how aligned first-token evaluation is with the text output along several dimensions.
arXiv Detail & Related papers (2024-02-22T12:47:33Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Enhancing Self-Consistency and Performance of Pre-Trained Language
Models through Natural Language Inference [72.61732440246954]
Large pre-trained language models often lack logical consistency across test inputs.
We propose a framework, ConCoRD, for boosting the consistency and accuracy of pre-trained NLP models.
We show that ConCoRD consistently boosts accuracy and consistency of off-the-shelf closed-book QA and VQA models.
arXiv Detail & Related papers (2022-11-21T21:58:30Z) - More Than Words: Towards Better Quality Interpretations of Text
Classifiers [16.66535643383862]
We show that token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.
We show that higher-level feature attributions offer several advantages: 1) they are more robust as measured by the randomization tests, 2) they lead to lower variability when using approximation-based methods like SHAP, and 3) they are more intelligible to humans in situations where the linguistic coherence resides at a higher level.
arXiv Detail & Related papers (2021-12-23T10:18:50Z) - Search Methods for Sufficient, Socially-Aligned Feature Importance
Explanations with In-Distribution Counterfactuals [72.00815192668193]
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time.
We study several under-explored dimensions of FI-based explanations, providing conceptual and empirical improvements for this form of explanation.
arXiv Detail & Related papers (2021-06-01T20:36:48Z) - A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data.
We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z) - Double Perturbation: On the Robustness of Robustness and Counterfactual
Bias Evaluation [109.06060143938052]
We propose a "double perturbation" framework to uncover model weaknesses beyond the test dataset.
We apply this framework to study two perturbation-based approaches that are used to analyze models' robustness and counterfactual bias in English.
arXiv Detail & Related papers (2021-04-12T06:57:36Z) - Selective Question Answering under Domain Shift [90.021577320085]
Abstention policies based solely on the model's softmax probabilities fare poorly, since models are overconfident on out-of-domain inputs.
We train a calibrator to identify inputs on which the QA model errs, and abstain when it predicts an error is likely.
Our method answers 56% of questions while maintaining 80% accuracy; in contrast, directly using the model's probabilities only answers 48% at 80% accuracy.
arXiv Detail & Related papers (2020-06-16T19:13:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.