Related papers: What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0

What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0

URL: http://arxiv.org/abs/2206.14348v1
Date: Wed, 29 Jun 2022 01:17:47 GMT
Title: What Can Secondary Predictions Tell Us? An Exploration on Question-Answering with SQuAD-v2.0
Authors: Michael Kamfonas and Gabriel Alon
Abstract summary: We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Performance in natural language processing, and specifically for the question-answer task, is typically measured by comparing a model\'s most confident (primary) prediction to golden answers (the ground truth). We are making the case that it is also useful to quantify how close a model came to predicting a correct answer even for examples that failed. We define the Golden Rank (GR) of an example as the rank of its most confident prediction that exactly matches a ground truth, and show why such a match always exists. For the 16 transformer models we analyzed, the majority of exactly matched golden answers in secondary prediction space hover very close to the top rank. We refer to secondary predictions as those ranking above 0 in descending confidence probability order. We demonstrate how the GR can be used to classify questions and visualize their spectrum of difficulty, from persistent near successes to persistent extreme failures. We derive a new aggregate statistic over entire test sets, named the Golden Rank Interpolated Median (GRIM) that quantifies the proximity of failed predictions to the top choice made by the model. To develop some intuition and explore the applicability of these metrics we use the Stanford Question Answering Dataset (SQuAD-2) and a few popular transformer models from the Hugging Face hub. We first demonstrate that the GRIM is not directly correlated with the F1 and exact match (EM) scores. We then calculate and visualize these scores for various transformer architectures, probe their applicability in error analysis by clustering failed predictions, and compare how they relate to other training diagnostics such as the EM and F1 scores. We finally suggest various research goals, such as broadening data collection for these metrics and their possible use in adversarial training.

Related papers

Correct after Answer: Enhancing Multi-Span Question Answering with Post-Processing Method [11.794628063040108]
Multi-Span Question Answering (MSQA) requires models to extract one or multiple answer spans from a given context to answer a question. We propose Answering-Classifying-Correcting (ACC) framework, which employs a post-processing strategy to handle incorrect predictions.
arXiv Detail & Related papers (2024-10-22T08:04:32Z)
When to Accept Automated Predictions and When to Defer to Human Judgment? [1.9922905420195367]
We analyze how the outputs of a trained neural network change using clustering to measure distances between outputs and class centroids. We propose this distance as a metric to evaluate the confidence of predictions under distribution shifts.
arXiv Detail & Related papers (2024-07-10T16:45:52Z)
Unsupervised Dense Retrieval with Relevance-Aware Contrastive Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning. We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks. Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z)
Realistic Conversational Question Answering with Answer Selection based on Calibrated Confidence and Uncertainty Measurement [54.55643652781891]
Conversational Question Answering (ConvQA) models aim at answering a question with its relevant paragraph and previous question-answer pairs that occurred during conversation multiple times. We propose to filter out inaccurate answers in the conversation history based on their estimated confidences and uncertainties from the ConvQA model. We validate our models, Answer Selection-based realistic Conversation Question Answering, on two standard ConvQA datasets.
arXiv Detail & Related papers (2023-02-10T09:42:07Z)
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics. Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets. Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z)
Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation [21.643073517681973]
Average Mahalanobis Distance (AMD) is a metric that quantifies how close the whole generated samples are to the ground truth. Average Maximum Eigenvalue (AMV) is a metric that quantifies the overall spread of the predictions. We introduce the usage of Implicit Maximum Likelihood Estimation (IMLE) as a replacement for traditional generative models to train our model, Social-Implicit.
arXiv Detail & Related papers (2022-03-06T21:28:40Z)
COM2SENSE: A Commonsense Reasoning Benchmark with Complementary Sentences [21.11065466376105]
Commonsense reasoning is intuitive for humans but has been a long-term challenge for artificial intelligence (AI) Recent advancements in pretrained language models have shown promising results on several commonsense benchmark datasets. We introduce a new commonsense reasoning benchmark dataset comprising natural language true/false statements.
arXiv Detail & Related papers (2021-06-02T06:31:55Z)
How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?" We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated. We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
Ambiguity in Sequential Data: Predicting Uncertain Futures with Recurrent Models [110.82452096672182]
We propose an extension of the Multiple Hypothesis Prediction (MHP) model to handle ambiguous predictions with sequential data. We also introduce a novel metric for ambiguous problems, which is better suited to account for uncertainties.
arXiv Detail & Related papers (2020-03-10T09:15:42Z)
Meta-Learned Confidence for Few-shot Learning [60.6086305523402]
A popular transductive inference technique for few-shot metric-based approaches, is to update the prototype of each class with the mean of the most confident query examples. We propose to meta-learn the confidence for each query sample, to assign optimal weights to unlabeled queries. We validate our few-shot learning model with meta-learned confidence on four benchmark datasets.
arXiv Detail & Related papers (2020-02-27T10:22:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.