Towards Confident Machine Reading Comprehension
- URL: http://arxiv.org/abs/2101.07942v2
- Date: Wed, 24 Feb 2021 04:32:30 GMT
- Title: Towards Confident Machine Reading Comprehension
- Authors: Rishav Chakravarti, Avirup Sil
- Abstract summary: We propose a novel post-prediction confidence estimation model, which we call Mr.C (short for Mr. Confident)
Mr.C can be trained to improve a system's ability to refrain from making incorrect predictions with improvements of up to 4 points as measured by Area Under the Curve (AUC) scores.
- Score: 7.989756186727329
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: There has been considerable progress on academic benchmarks for the Reading
Comprehension (RC) task with State-of-the-Art models closing the gap with human
performance on extractive question answering. Datasets such as SQuAD 2.0 & NQ
have also introduced an auxiliary task requiring models to predict when a
question has no answer in the text. However, in production settings, it is also
necessary to provide confidence estimates for the performance of the underlying
RC model at both answer extraction and "answerability" detection. We propose a
novel post-prediction confidence estimation model, which we call Mr.C (short
for Mr. Confident), that can be trained to improve a system's ability to
refrain from making incorrect predictions with improvements of up to 4 points
as measured by Area Under the Curve (AUC) scores. Mr.C can benefit from a novel
white-box feature that leverages the underlying RC model's gradients.
Performance prediction is particularly important in cases of domain shift (as
measured by training RC models on SQUAD 2.0 and evaluating on NQ), where Mr.C
not only improves AUC, but also traditional answerability prediction (as
measured by a 5 point improvement in F1).
Related papers
- Uncertainty Quantification in Retrieval Augmented Question Answering [57.05827081638329]
We propose to quantify the uncertainty of a QA model via estimating the utility of the passages it is provided with.
We train a lightweight neural model to predict passage utility for a target QA model and show that while simple information theoretic metrics can predict answer correctness up to a certain extent, our approach efficiently approximates or outperforms more expensive sampling-based methods.
arXiv Detail & Related papers (2025-02-25T11:24:52Z) - Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.
Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.
We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z) - Towards Robust Extractive Question Answering Models: Rethinking the Training Methodology [0.34530027457862006]
Previous research has shown that existing models, when trained on EQA datasets that include unanswerable questions, demonstrate a significant lack of robustness.
Our proposed training method includes a novel loss function for the EQA problem and challenges an implicit assumption present in numerous EQA datasets.
Our models exhibit significantly enhanced robustness against two types of adversarial attacks, with a performance decrease of only about a third compared to the default models.
arXiv Detail & Related papers (2024-09-29T20:35:57Z) - RICA2: Rubric-Informed, Calibrated Assessment of Actions [8.641411594566714]
We present RICA2 - a deep probabilistic model that score rubric and accounts for prediction uncertainty for action quality assessment (AQA)
We demonstrate that our method establishes new state of the art on public benchmarks, including FineDiving, MTL-AQA, and JIGSAWS, with superior performance in score prediction and uncertainty calibration.
arXiv Detail & Related papers (2024-08-04T20:35:33Z) - Uncertainty-aware Language Modeling for Selective Question Answering [107.47864420630923]
We present an automatic large language model (LLM) conversion approach that produces uncertainty-aware LLMs.
Our approach is model- and data-agnostic, is computationally-efficient, and does not rely on external models or systems.
arXiv Detail & Related papers (2023-11-26T22:47:54Z) - Toward Reliable Human Pose Forecasting with Uncertainty [51.628234388046195]
We develop an open-source library for human pose forecasting, including multiple models, supporting several datasets.
We devise two types of uncertainty in the problem to increase performance and convey better trust.
arXiv Detail & Related papers (2023-04-13T17:56:08Z) - VisFIS: Visual Feature Importance Supervision with
Right-for-the-Right-Reason Objectives [84.48039784446166]
We show that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason metrics.
Our best performing method, Visual Feature Importance Supervision (VisFIS), outperforms strong baselines on benchmark VQA datasets.
Predictions are more accurate when explanations are plausible and faithful, and not when they are plausible but not faithful.
arXiv Detail & Related papers (2022-06-22T17:02:01Z) - Balancing Cost and Quality: An Exploration of Human-in-the-loop
Frameworks for Automated Short Answer Scoring [36.58449231222223]
Short answer scoring (SAS) is the task of grading short text written by a learner.
We present the first study of exploring the use of human-in-the-loop framework for minimizing the grading cost.
We find that our human-in-the-loop framework allows automatic scoring models and human graders to achieve the target scoring quality.
arXiv Detail & Related papers (2022-06-16T16:43:18Z) - Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks.
First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU.
Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z) - RECONSIDER: Re-Ranking using Span-Focused Cross-Attention for Open
Domain Question Answering [49.024513062811685]
We develop a simple and effective re-ranking approach (RECONSIDER) for span-extraction tasks.
RECONSIDER is trained on positive and negative examples extracted from high confidence predictions of MRC models.
It uses in-passage span annotations to perform span-focused re-ranking over a smaller candidate set.
arXiv Detail & Related papers (2020-10-21T04:28:42Z) - Value-driven Hindsight Modelling [68.658900923595]
Value estimation is a critical component of the reinforcement learning (RL) paradigm.
Model learning can make use of the rich transition structure present in sequences of observations, but this approach is usually not sensitive to the reward function.
We develop an approach for representation learning in RL that sits in between these two extremes.
This provides tractable prediction targets that are directly relevant for a task, and can thus accelerate learning the value function.
arXiv Detail & Related papers (2020-02-19T18:10:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.