Surface Form Competition: Why the Highest Probability Answer Isn't
Always Right
- URL: http://arxiv.org/abs/2104.08315v1
- Date: Fri, 16 Apr 2021 18:57:19 GMT
- Title: Surface Form Competition: Why the Highest Probability Answer Isn't
Always Right
- Authors: Ari Holtzman, Peter West, Vered Schwartz, Yejin Choi, Luke Zettlemoyer
- Abstract summary: Domain Conditional Pointwise Mutual Information compensates for surface form competition.
It achieves consistent gains in zero-shot performance over both calibrated and uncalibrated scoring functions.
- Score: 70.71122438366142
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models have shown promising results in zero-shot settings
(Brown et al.,2020; Radford et al., 2019). For example, they can perform
multiple choice tasks simply by conditioning on a question and selecting the
answer with the highest probability.
However, ranking by string probability can be problematic due to surface form
competition-wherein different surface forms compete for probability mass, even
if they represent the same underlying concept, e.g. "computer" and "PC." Since
probability mass is finite, this lowers the probability of the correct answer,
due to competition from other strings that are valid answers (but not one of
the multiple choice options).
We introduce Domain Conditional Pointwise Mutual Information, an alternative
scoring function that directly compensates for surface form competition by
simply reweighing each option according to a term that is proportional to its a
priori likelihood within the context of the specific zero-shot task. It
achieves consistent gains in zero-shot performance over both calibrated (Zhao
et al., 2021) and uncalibrated scoring functions on all GPT-2 and GPT-3 models
over a variety of multiple choice datasets.
Related papers
- Differentiating Choices via Commonality for Multiple-Choice Question Answering [54.04315943420376]
Multiple-choice question answering can provide valuable clues for choosing the right answer.
Existing models often rank each choice separately, overlooking the context provided by other choices.
We propose a novel model by differentiating choices through identifying and eliminating their commonality, called DCQA.
arXiv Detail & Related papers (2024-08-21T12:05:21Z) - Increasing Probability Mass on Answer Choices Does Not Always Improve
Accuracy [60.18632773935895]
Spreading probability mass across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance.
We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time.
We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example.
arXiv Detail & Related papers (2023-05-24T00:27:00Z) - Reconciling Individual Probability Forecasts [78.0074061846588]
We show that two parties who agree on the data cannot disagree on how to model individual probabilities.
We conclude that although individual probabilities are unknowable, they are contestable via a computationally and data efficient process.
arXiv Detail & Related papers (2022-09-04T20:20:35Z) - Conflict-free joint sampling for preference satisfaction through quantum
interference [0.0]
Two problems exist regarding the optimal joint decision-making method.
First, as the number of choices increases, the computational cost of calculating the optimal joint selection probability matrix explodes.
Second, to derive the optimal joint selection probability matrix, all players must disclose their probabilistic preferences.
arXiv Detail & Related papers (2022-08-05T10:38:17Z) - What Can Secondary Predictions Tell Us? An Exploration on
Question-Answering with SQuAD-v2.0 [0.0]
We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth.
For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank.
We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model.
arXiv Detail & Related papers (2022-06-29T01:17:47Z) - Feature Selection by a Mechanism Design [0.0]
We study the selection problem where the players are the candidates and the payoff function is a performance measurement.
In theory, an irrelevant feature is equivalent to a dummy player in the game, which contributes nothing to all modeling situations.
In our mechanism design, the end goal perfectly matches the expected model performance with the expected sum of individual marginal effects.
arXiv Detail & Related papers (2021-10-05T23:53:14Z) - A Semantic-based Method for Unsupervised Commonsense Question Answering [40.18557352036813]
Unsupervised commonsense question answering is appealing since it does not rely on any labeled task data.
We present a novel SEmantic-based Question Answering method (SEQA) for unsupervised commonsense question answering.
arXiv Detail & Related papers (2021-05-31T08:21:52Z) - Generative Context Pair Selection for Multi-hop Question Answering [60.74354009152721]
We propose a generative context selection model for multi-hop question answering.
Our proposed generative passage selection model has a better performance (4.9% higher than baseline) on adversarial held-out set.
arXiv Detail & Related papers (2021-04-18T07:00:48Z) - On Steady-State Evolutionary Algorithms and Selective Pressure: Why
Inverse Rank-Based Allocation of Reproductive Trials is Best [9.290757451344673]
We analyse the impact of the selective pressure for the global optimisation capabilities of steady-state EAs.
For the standard bimodal benchmark function twomax we rigorously prove that using uniform parent selection leads to exponentials with high probability to locate both optima.
On the other hand, we prove that selecting the worst individual as parent leads to efficient global optimisation with overwhelming probability for reasonable population sizes.
arXiv Detail & Related papers (2021-03-18T17:27:05Z) - MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for
Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker.
We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism.
Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.