What Can Secondary Predictions Tell Us? An Exploration on
Question-Answering with SQuAD-v2.0
- URL: http://arxiv.org/abs/2206.14348v1
- Date: Wed, 29 Jun 2022 01:17:47 GMT
- Title: What Can Secondary Predictions Tell Us? An Exploration on
Question-Answering with SQuAD-v2.0
- Authors: Michael Kamfonas and Gabriel Alon
- Abstract summary: We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth.
For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank.
We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Performance in natural language processing, and specifically for the
question-answer task, is typically measured by comparing a model\'s most
confident (primary) prediction to golden answers (the ground truth). We are
making the case that it is also useful to quantify how close a model came to
predicting a correct answer even for examples that failed. We define the Golden
Rank (GR) of an example as the rank of its most confident prediction that
exactly matches a ground truth, and show why such a match always exists. For
the 16 transformer models we analyzed, the majority of exactly matched golden
answers in secondary prediction space hover very close to the top rank. We
refer to secondary predictions as those ranking above 0 in descending
confidence probability order. We demonstrate how the GR can be used to classify
questions and visualize their spectrum of difficulty, from persistent near
successes to persistent extreme failures. We derive a new aggregate statistic
over entire test sets, named the Golden Rank Interpolated Median (GRIM) that
quantifies the proximity of failed predictions to the top choice made by the
model. To develop some intuition and explore the applicability of these metrics
we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular
transformer models from the Hugging Face hub. We first demonstrate that the
GRIM is not directly correlated with the F1 and exact match (EM) scores. We
then calculate and visualize these scores for various transformer
architectures, probe their applicability in error analysis by clustering failed
predictions, and compare how they relate to other training diagnostics such as
the EM and F1 scores. We finally suggest various research goals, such as
broadening data collection for these metrics and their possible use in
adversarial training.
Related papers
- Correct after Answer: Enhancing Multi-Span Question Answering with Post-Processing Method [11.794628063040108]
Multi-Span Question Answering (MSQA) requires models to extract one or multiple answer spans from a given context to answer a question.
We propose Answering-Classifying-Correcting (ACC) framework, which employs a post-processing strategy to handle incorrect predictions.
arXiv Detail & Related papers (2024-10-22T08:04:32Z) - When to Accept Automated Predictions and When to Defer to Human Judgment? [1.9922905420195367]
We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids.
We propose this distance as a metric to evaluate the confidence of predictions under distribution shifts.
arXiv Detail & Related papers (2024-07-10T16:45:52Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - Realistic Conversational Question Answering with Answer Selection based
on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times.
We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model.
We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z) - VisFIS: Visual Feature Importance Supervision with
Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics.
Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets.
Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z) - Social-Implicit: Rethinking Trajectory Prediction Evaluation and The
Effectiveness of Implicit Maximum Likelihood Estimation [21.643073517681973]
Average Mahalanobis Distance (AMD) is a metric that quantifies how close the whole generated samples are to the ground truth.
Average Maximum Eigenvalue (AMV) is a metric that quantifies the overall spread of the predictions.
We introduce the usage of Implicit Maximum Likelihood Estimation (IMLE) as a replacement for traditional generative models to train our model, Social-Implicit.
arXiv Detail & Related papers (2022-03-06T21:28:40Z) - COM2SENSE: A Commonsense Reasoning Benchmark with Complementary
Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI)
Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets.
We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Ambiguity in Sequential Data: Predicting Uncertain Futures with
Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data.
We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z) - Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples.
We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries.
We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.