Related papers: Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

URL: http://arxiv.org/abs/2305.14596v2
Date: Tue, 31 Oct 2023 22:07:10 GMT
Title: Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy
Authors: Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark, Ashish Sabharwal
Abstract summary: Spreading probability mass across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance. We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example.
Score: 60.18632773935895
License: http://creativecommons.org/licenses/by/4.0/
Abstract: When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as the "surface form competition" (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance? We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks.

Related papers

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
One of the most widely used tasks to evaluate Large Language Models (LLMs) is Multiple-Choice Question Answering (MCQA) In this work, we shed light on the inconsistencies of MCQA evaluation strategies, which can lead to inaccurate and misleading model comparisons.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
SimpleStrat: Diversifying Language Model Generation with Stratification [26.933029655072488]
Prior approaches rely on increasing temperature to increase diversity. We show this approach produces lower quality individual generations as temperature increases. We propose SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata.
arXiv Detail & Related papers (2024-10-11T17:54:14Z)
Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models. We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles.
arXiv Detail & Related papers (2024-07-21T00:10:23Z)
FSM: A Finite State Machine Based Zero-Shot Prompting Paradigm for Multi-Hop Question Answering [26.398873686905063]
Large Language Models (LLMs) with chain-of-thought (COT) prompting have demonstrated impressive abilities on simple nature language inference tasks. We propose a prompting method, Finite State Machine (FSM) to enhance the reasoning capabilities of LLM for complex tasks.
arXiv Detail & Related papers (2024-07-03T10:01:01Z)
CASE: Commonsense-Augmented Score with an Expanded Answer Space [13.915710684653174]
We propose CASE, a Commonsense-Augmented Score with an Expanded Answer Space. Case addresses the limitation by assigning importance weights for individual words based on their semantic relations to other words in the input. We then also follow prior work in expanding the answer space by generating lexically-divergent answers that are conceptually-similar to the choices.
arXiv Detail & Related papers (2023-11-03T03:15:26Z)
Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data. Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z)
Momentum Contrastive Pre-training for Question Answering [54.57078061878619]
MCROSS introduces a momentum contrastive learning framework to align the answer probability between cloze-like and natural query-passage sample pairs. Our method achieves noticeable improvement compared with all baselines in both supervised and zero-shot scenarios.
arXiv Detail & Related papers (2022-12-12T08:28:22Z)
Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems. This stems from variance differences between policies, which causes two problems. We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z)
An Investigation of Replay-based Approaches for Continual Learning [79.0660895390689]
Continual learning (CL) is a major challenge of machine learning (ML) and describes the ability to learn several tasks sequentially without catastrophic forgetting (CF) Several solution classes have been proposed, of which so-called replay-based approaches seem very promising due to their simplicity and robustness. We empirically investigate replay-based approaches of continual learning and assess their potential for applications.
arXiv Detail & Related papers (2021-08-15T15:05:02Z)
A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z)
MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker. We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism. Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z)
L2R2: Leveraging Ranking for Abductive Reasoning [65.40375542988416]
The abductive natural language inference task ($alpha$NLI) is proposed to evaluate the abductive reasoning ability of a learning system. A novel $L2R2$ approach is proposed under the learning-to-rank framework. Experiments on the ART dataset reach the state-of-the-art in the public leaderboard.
arXiv Detail & Related papers (2020-05-22T15:01:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.