F1 is Not Enough! Models and Evaluation Towards User-Centered
Explainable Question Answering
- URL: http://arxiv.org/abs/2010.06283v1
- Date: Tue, 13 Oct 2020 10:53:20 GMT
- Title: F1 is Not Enough! Models and Evaluation Towards User-Centered
Explainable Question Answering
- Authors: Hendrik Schuff, Heike Adel, Ngoc Thang Vu
- Abstract summary: We show that current models and evaluation settings have shortcomings regarding the coupling of answer and explanation.
We propose a hierarchical model and a new regularization term to strengthen the answer-explanation coupling.
Our scores are better aligned with user experience, making them promising candidates for model selection.
- Score: 30.95495958937006
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explainable question answering systems predict an answer together with an
explanation showing why the answer has been selected. The goal is to enable
users to assess the correctness of the system and understand its reasoning
process. However, we show that current models and evaluation settings have
shortcomings regarding the coupling of answer and explanation which might cause
serious issues in user experience. As a remedy, we propose a hierarchical model
and a new regularization term to strengthen the answer-explanation coupling as
well as two evaluation scores to quantify the coupling. We conduct experiments
on the HOTPOTQA benchmark data set and perform a user study. The user study
shows that our models increase the ability of the users to judge the
correctness of the system and that scores like F1 are not enough to estimate
the usefulness of a model in a practical setting with human users. Our scores
are better aligned with user experience, making them promising candidates for
model selection.
Related papers
- Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers.
We present users with incorrect answers and corresponding rationales in various formats.
We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z) - Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering [26.34649731975005]
Retriever-augmented instruction-following models are attractive alternatives to fine-tuned approaches for question answering (QA)
While the model responses tend to be natural and fluent, the additional verbosity makes traditional QA evaluation metrics unreliable for accurately quantifying model performance.
We use both automatic and human evaluation to evaluate these models along two dimensions: 1) how well they satisfy the user's information need (correctness) and 2) whether they produce a response based on the provided knowledge (faithfulness)
arXiv Detail & Related papers (2023-07-31T17:41:00Z) - Improving Selective Visual Question Answering by Learning from Your
Peers [74.20167944693424]
Visual Question Answering (VQA) models can have difficulties abstaining from answering when they are wrong.
We propose Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions.
Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model.
arXiv Detail & Related papers (2023-06-14T21:22:01Z) - What Else Do I Need to Know? The Effect of Background Information on
Users' Reliance on QA Systems [23.69129423040988]
We study how users interact with QA systems in the absence of sufficient information to assess their predictions.
Our study reveals that users rely on model predictions even in the absence of sufficient information needed to assess the model's correctness.
arXiv Detail & Related papers (2023-05-23T17:57:12Z) - Towards Teachable Reasoning Systems [29.59387051046722]
We develop a teachable reasoning system for question-answering (QA)
Our approach is three-fold: First, generated chains of reasoning show how answers are implied by the system's own internal beliefs.
Second, users can interact with the explanations to identify erroneous model beliefs and provide corrections.
Third, we augment the model with a dynamic memory of such corrections.
arXiv Detail & Related papers (2022-04-27T17:15:07Z) - Using Interactive Feedback to Improve the Accuracy and Explainability of
Question Answering Systems Post-Deployment [20.601284299825895]
We focus on two kinds of improvements: 1) improving the QA system's performance itself, and 2) providing the model with the ability to explain the correctness or incorrectness of an answer.
We collect a retrieval-based QA dataset, FeedbackQA, which contains interactive feedback from users.
We show that feedback data not only improves the accuracy of the deployed QA system but also other stronger non-deployed systems.
arXiv Detail & Related papers (2022-04-06T18:17:09Z) - A New Score for Adaptive Tests in Bayesian and Credal Networks [64.80185026979883]
A test is adaptive when its sequence and number of questions is dynamically tuned on the basis of the estimated skills of the taker.
We present an alternative family of scores, based on the mode of the posterior probabilities, and hence easier to explain.
arXiv Detail & Related papers (2021-05-25T20:35:42Z) - MS-Ranker: Accumulating Evidence from Potentially Correct Candidates for
Answer Selection [59.95429407899612]
We propose a novel reinforcement learning based multi-step ranking model, named MS-Ranker.
We explicitly consider the potential correctness of candidates and update the evidence with a gating mechanism.
Our model significantly outperforms existing methods that do not rely on external resources.
arXiv Detail & Related papers (2020-10-10T10:36:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.