Related papers: Surface Form Competition: Why the Highest Probability Answer Isn't Always Right

Surface Form Competition: Why the Highest Probability Answer Isn't Always Right

URL: http://arxiv.org/abs/2104.08315v1
Date: Fri, 16 Apr 2021 18:57:19 GMT
Title: Surface Form Competition: Why the Highest Probability Answer Isn't Always Right
Authors: Ari Holtzman, Peter West, Vered Schwartz, Yejin Choi, Luke Zettlemoyer
Abstract summary: Domain Conditional Pointwise Mutual Information compensates for surface form competition. It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions.
Score: 70.71122438366142
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models have shown promising results in zero-shot settings (Brown et al.,2020; Radford et al., 2019). For example, they can perform multiple choice tasks simply by conditioning on a question and selecting the answer with the highest probability. However, ranking by string probability can be problematic due to surface form competition-wherein different surface forms compete for probability mass, even if they represent the same underlying concept, e.g. "computer" and "PC." Since probability mass is finite, this lowers the probability of the correct answer, due to competition from other strings that are valid answers (but not one of the multiple choice options). We introduce Domain Conditional Pointwise Mutual Information, an alternative scoring function that directly compensates for surface form competition by simply reweighing each option according to a term that is proportional to its a priori likelihood within the context of the specific zero-shot task. It achieves consistent gains in zero-shot performance over both calibrated (Zhao et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models over a variety of multiple choice datasets.

Related papers

Language Model Probabilities are Not Calibrated in Numeric Contexts [16.17638166383352]
We argue that language model (LM) outputs should capture natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts.
arXiv Detail & Related papers (2024-10-21T13:41:15Z)
Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer. Existing models often rank each choice separately, overlooking the context provided by other choices. We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z)
Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy [60.18632773935895]
Spreading probability mass across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance. We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example.
arXiv Detail & Related papers (2023-05-24T00:27:00Z)
Reconciling Individual Probability Forecasts [78.0074061846588]
We show that two parties who agree on the data cannot disagree on how to model individual probabilities. We conclude that although individual probabilities are unknowable, they are contestable via a computationally and data efficient process.
arXiv Detail & Related papers (2022-09-04T20:20:35Z)
Conflict-free joint sampling for preference satisfaction through quantum interference [0.0]
Two problems exist regarding the optimal joint decision-making method. First, as the number of choices increases, the computational cost of calculating the optimal joint selection probability matrix explodes. Second, to derive the optimal joint selection probability matrix, all players must disclose their probabilistic preferences.
arXiv Detail & Related papers (2022-08-05T10:38:17Z)
What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0 [0.0]
We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model.
arXiv Detail & Related papers (2022-06-29T01:17:47Z)
Feature Selection by a Mechanism Design [0.0]
We study the selection problem where the players are the candidates and the payoff function is a performance measurement. In theory, an irrelevant feature is equivalent to a dummy player in the game, which contributes nothing to all modeling situations. In our mechanism design, the end goal perfectly matches the expected model performance with the expected sum of individual marginal effects.
arXiv Detail & Related papers (2021-10-05T23:53:14Z)
A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data. We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z)
Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering. Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z)
On Steady-State Evolutionary Algorithms and Selective Pressure: Why Inverse Rank-Based Allocation of Reproductive Trials is Best [9.290757451344673]
We analyse the impact of the selective pressure for the global optimisation capabilities of steady-state EAs. For the standard bimodal benchmark function twomax we rigorously prove that using uniform parent selection leads to exponentials with high probability to locate both optima. On the other hand, we prove that selecting the worst individual as parent leads to efficient global optimisation with overwhelming probability for reasonable population sizes.
arXiv Detail & Related papers (2021-03-18T17:27:05Z)
MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker. We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism. Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.