Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA
- URL: http://arxiv.org/abs/2012.15075v1
- Date: Wed, 30 Dec 2020 08:19:02 GMT
- Title: Human Evaluation of Spoken vs. Visual Explanations for Open-Domain QA
- Authors: Ana Valeria Gonzalez, Gagan Bansal, Angela Fan, Robin Jia, Yashar
Mehdad and Srinivasan Iyer
- Abstract summary: We study whether explanations help users correctly decide when to accept or reject an ODQA system's answer.
Our results show that explanations derived from retrieved evidence passages can outperform strong baselines (calibrated confidence) across modalities.
We show common failure cases of current explanations, emphasize end-to-end evaluation of explanations, and caution against evaluating them in proxy modalities that are different from deployment.
- Score: 22.76153284711981
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: While research on explaining predictions of open-domain QA systems (ODQA) to
users is gaining momentum, most works have failed to evaluate the extent to
which explanations improve user trust. While few works evaluate explanations
using user studies, they employ settings that may deviate from the end-user's
usage in-the-wild: ODQA is most ubiquitous in voice-assistants, yet current
research only evaluates explanations using a visual display, and may
erroneously extrapolate conclusions about the most performant explanations to
other modalities. To alleviate these issues, we conduct user studies that
measure whether explanations help users correctly decide when to accept or
reject an ODQA system's answer. Unlike prior work, we control for explanation
modality, e.g., whether they are communicated to users through a spoken or
visual interface, and contrast effectiveness across modalities. Our results
show that explanations derived from retrieved evidence passages can outperform
strong baselines (calibrated confidence) across modalities but the best
explanation strategy in fact changes with the modality. We show common failure
cases of current explanations, emphasize end-to-end evaluation of explanations,
and caution against evaluating them in proxy modalities that are different from
deployment.
Related papers
- Auditing Local Explanations is Hard [14.172657936593582]
We investigate an auditing framework in which a third-party auditor or a collective of users attempts to sanity-check explanations.
We prove upper and lower bounds on the amount of queries that are needed for an auditor to succeed within this framework.
Our results suggest that for complex high-dimensional settings, merely providing a pointwise prediction and explanation could be insufficient.
arXiv Detail & Related papers (2024-07-18T08:34:05Z) - Rethinking the Evaluation of Dialogue Systems: Effects of User Feedback on Crowdworkers and LLMs [57.16442740983528]
In ad-hoc retrieval, evaluation relies heavily on user actions, including implicit feedback.
The role of user feedback in annotators' assessment of turns in a conversational perception has been little studied.
We focus on how the evaluation of task-oriented dialogue systems ( TDSs) is affected by considering user feedback, explicit or implicit, as provided through the follow-up utterance of a turn being evaluated.
arXiv Detail & Related papers (2024-04-19T16:45:50Z) - Introducing User Feedback-based Counterfactual Explanations (UFCE) [49.1574468325115]
Counterfactual explanations (CEs) have emerged as a viable solution for generating comprehensible explanations in XAI.
UFCE allows for the inclusion of user constraints to determine the smallest modifications in the subset of actionable features.
UFCE outperforms two well-known CE methods in terms of textitproximity, textitsparsity, and textitfeasibility.
arXiv Detail & Related papers (2024-02-26T20:09:44Z) - What if you said that differently?: How Explanation Formats Affect Human Feedback Efficacy and User Perception [53.4840989321394]
We analyze the effect of rationales generated by QA models to support their answers.
We present users with incorrect answers and corresponding rationales in various formats.
We measure the effectiveness of this feedback in patching these rationales through in-context learning.
arXiv Detail & Related papers (2023-11-16T04:26:32Z) - Continually Improving Extractive QA via Human Feedback [59.49549491725224]
We study continually improving an extractive question answering (QA) system via human user feedback.
We conduct experiments involving thousands of user interactions under diverse setups to broaden the understanding of learning from feedback over time.
arXiv Detail & Related papers (2023-05-21T14:35:32Z) - Explanation Selection Using Unlabeled Data for Chain-of-Thought
Prompting [80.9896041501715]
Explanations that have not been "tuned" for a task, such as off-the-shelf explanations written by nonexperts, may lead to mediocre performance.
This paper tackles the problem of how to optimize explanation-infused prompts in a blackbox fashion.
arXiv Detail & Related papers (2023-02-09T18:02:34Z) - How (Not) To Evaluate Explanation Quality [29.40729766120284]
We formulate desired characteristics of explanation quality that apply across tasks and domains.
We propose actionable guidelines to overcome obstacles that limit today's evaluation of explanation quality.
arXiv Detail & Related papers (2022-10-13T16:06:59Z) - Features of Explainability: How users understand counterfactual and
causal explanations for categorical and continuous features in XAI [10.151828072611428]
Counterfactual explanations are increasingly used to address interpretability, recourse, and bias in AI decisions.
We tested the effects of counterfactual and causal explanations on the objective accuracy of users predictions.
We also found that users understand explanations referring to categorical features more readily than those referring to continuous features.
arXiv Detail & Related papers (2022-04-21T15:01:09Z) - Improving Conversational Question Answering Systems after Deployment
using Feedback-Weighted Learning [69.42679922160684]
We propose feedback-weighted learning based on importance sampling to improve upon an initial supervised system using binary user feedback.
Our work opens the prospect to exploit interactions with real users and improve conversational systems after deployment.
arXiv Detail & Related papers (2020-11-01T19:50:34Z) - Explaining reputation assessments [6.87724532311602]
We propose an approach to explain the rationale behind assessments from quantitative reputation models.
Our approach adapts, extends and combines existing approaches for explaining decisions made using multi-attribute decision models.
arXiv Detail & Related papers (2020-06-15T23:19:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.