Increasing Probability Mass on Answer Choices Does Not Always Improve
Accuracy
- URL: http://arxiv.org/abs/2305.14596v2
- Date: Tue, 31 Oct 2023 22:07:10 GMT
- Title: Increasing Probability Mass on Answer Choices Does Not Always Improve
Accuracy
- Authors: Sarah Wiegreffe, Matthew Finlayson, Oyvind Tafjord, Peter Clark,
Ashish Sabharwal
- Abstract summary: Spreading probability mass across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance.
We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time.
We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example.
- Score: 60.18632773935895
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When pretrained language models (LMs) are applied to discriminative tasks
such as multiple-choice questions, they place probability mass on vocabulary
tokens that aren't among the given answer choices. Spreading probability mass
across multiple surface forms with identical meaning (such as "bath" and
"bathtub") is thought to cause an underestimation of a model's true
performance, referred to as the "surface form competition" (SFC) hypothesis.
This has motivated the introduction of various probability normalization
methods. However, many core questions remain unanswered. How do we measure SFC?
Are there direct ways of reducing it, and does doing so improve task
performance?
We propose a mathematical formalism for SFC which allows us to quantify and
bound its impact for the first time. We identify a simple method for reducing
it -- namely, increasing probability mass on the given answer choices by a)
including them in the prompt and b) using in-context learning with even just
one example. We show this method eliminates the impact of SFC in the majority
of instances. Our experiments on three diverse datasets and six LMs reveal
several additional surprising findings. For example, both normalization and
prompting methods for reducing SFC can be ineffective or even detrimental to
task performance for some LMs. We conclude with practical insights for
effectively prompting LMs for multiple-choice tasks.
Related papers
- SimpleStrat: Diversifying Language Model Generation with Stratification [26.933029655072488]
Prior approaches rely on increasing temperature to increase diversity.
We show this approach produces lower quality individual generations as temperature increases.
We propose SimpleStrat, an alternative approach that uses the language model itself to partition the space into strata.
arXiv Detail & Related papers (2024-10-11T17:54:14Z) - FSM: A Finite State Machine Based Zero-Shot Prompting Paradigm for Multi-Hop Question Answering [26.398873686905063]
Large Language Models (LLMs) with chain-of-thought (COT) prompting have demonstrated impressive abilities on simple nature language inference tasks.
We propose a prompting method, Finite State Machine (FSM) to enhance the reasoning capabilities of LLM for complex tasks.
arXiv Detail & Related papers (2024-07-03T10:01:01Z) - CASE: Commonsense-Augmented Score with an Expanded Answer Space [13.915710684653174]
We propose CASE, a Commonsense-Augmented Score with an Expanded Answer Space.
Case addresses the limitation by assigning importance weights for individual words based on their semantic relations to other words in the input.
We then also follow prior work in expanding the answer space by generating lexically-divergent answers that are conceptually-similar to the choices.
arXiv Detail & Related papers (2023-11-03T03:15:26Z) - Test-Time Self-Adaptive Small Language Models for Question Answering [63.91013329169796]
We show and investigate the capabilities of smaller self-adaptive LMs, only with unlabeled test data.
Our proposed self-adaption strategy demonstrates significant performance improvements on benchmark QA datasets.
arXiv Detail & Related papers (2023-10-20T06:49:32Z) - Momentum Contrastive Pre-training for Question Answering [54.57078061878619]
MCROSS introduces a momentum contrastive learning framework to align the answer probability between cloze-like and natural query-passage sample pairs.
Our method achieves noticeable improvement compared with all baselines in both supervised and zero-shot scenarios.
arXiv Detail & Related papers (2022-12-12T08:28:22Z) - Can Q-learning solve Multi Armed Bantids? [0.0]
We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems.
This stems from variance differences between policies, which causes two problems.
We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
arXiv Detail & Related papers (2021-10-21T07:08:30Z) - An Investigation of Replay-based Approaches for Continual Learning [79.0660895390689]
Continual learning (CL) is a major challenge of machine learning (ML) and describes the ability to learn several tasks sequentially without catastrophic forgetting (CF)
Several solution classes have been proposed, of which so-called replay-based approaches seem very promising due to their simplicity and robustness.
We empirically investigate replay-based approaches of continual learning and assess their potential for applications.
arXiv Detail & Related papers (2021-08-15T15:05:02Z) - A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data.
We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z) - MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for
Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker.
We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism.
Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z) - L2R2: Leveraging Ranking for Abductive Reasoning [65.40375542988416]
The abductive natural language inference task ($alpha$NLI) is proposed to evaluate the abductive reasoning ability of a learning system.
A novel $L2R2$ approach is proposed under the learning-to-rank framework.
Experiments on the ART dataset reach the state-of-the-art in the public leaderboard.
arXiv Detail & Related papers (2020-05-22T15:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.