Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems
- URL: http://arxiv.org/abs/2001.08298v1
- Date: Wed, 22 Jan 2020 22:14:28 GMT
- Title: Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems
- Authors: Zana Bu\c{c}inca, Phoebe Lin, Krzysztof Z. Gajos, Elena L. Glassman
- Abstract summary: We evaluate two currently common techniques for evaluating XAI systems.
We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks.
Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
- Score: 14.940404609343432
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Explainable artificially intelligent (XAI) systems form part of
sociotechnical systems, e.g., human+AI teams tasked with making decisions. Yet,
current XAI systems are rarely evaluated by measuring the performance of
human+AI teams on actual decision-making tasks. We conducted two online
experiments and one in-person think-aloud study to evaluate two currently
common techniques for evaluating XAI systems: (1) using proxy, artificial tasks
such as how well humans predict the AI's decision from the given explanations,
and (2) using subjective measures of trust and preference as predictors of
actual performance. The results of our experiments demonstrate that evaluations
with proxy tasks did not predict the results of the evaluations with the actual
decision-making tasks. Further, the subjective measures on evaluations with
actual decision-making tasks did not predict the objective performance on those
same tasks. Our results suggest that by employing misleading evaluation
methods, our field may be inadvertently slowing its progress toward developing
human+AI teams that can reliably perform better than humans or AIs alone.
Related papers
- How much informative is your XAI? A decision-making assessment task to
objectively measure the goodness of explanations [53.01494092422942]
The number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years.
It emerged that user-centred approaches to XAI positively affect the interaction between users and systems.
We propose an assessment task to objectively and quantitatively measure the goodness of XAI systems.
arXiv Detail & Related papers (2023-12-07T15:49:39Z) - Training Towards Critical Use: Learning to Situate AI Predictions
Relative to Human Knowledge [22.21959942886099]
We introduce a process-oriented notion of appropriate reliance called critical use that centers the human's ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model.
We conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening.
We find that, by providing participants with accelerated, low-stakes opportunities to practice AI-assisted decision-making, novices came to exhibit patterns of disagreement with AI that resemble those of experienced workers.
arXiv Detail & Related papers (2023-08-30T01:54:31Z) - The Impact of Imperfect XAI on Human-AI Decision-Making [8.305869611846775]
We evaluate how incorrect explanations influence humans' decision-making behavior in a bird species identification task.
Our findings reveal the influence of imperfect XAI and humans' level of expertise on their reliance on AI and human-AI team performance.
arXiv Detail & Related papers (2023-07-25T15:19:36Z) - An Experimental Investigation into the Evaluation of Explainability
Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references.
Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z) - Knowing About Knowing: An Illusion of Human Competence Can Hinder
Appropriate Reliance on AI Systems [13.484359389266864]
This paper addresses whether the Dunning-Kruger Effect (DKE) can hinder appropriate reliance on AI systems.
DKE is a metacognitive bias due to which less-competent individuals overestimate their own skill and performance.
We found that participants who overestimate their performance tend to exhibit under-reliance on AI systems.
arXiv Detail & Related papers (2023-01-25T14:26:10Z) - Advancing Human-AI Complementarity: The Impact of User Expertise and
Algorithmic Tuning on Joint Decision Making [10.890854857970488]
Many factors can impact success of Human-AI teams, including a user's domain expertise, mental models of an AI system, trust in recommendations, and more.
Our study examined user performance in a non-trivial blood vessel labeling task where participants indicated whether a given blood vessel was flowing or stalled.
Our results show that while recommendations from an AI-Assistant can aid user decision making, factors such as users' baseline performance relative to the AI and complementary tuning of AI error types significantly impact overall team performance.
arXiv Detail & Related papers (2022-08-16T21:39:58Z) - Doubting AI Predictions: Influence-Driven Second Opinion Recommendation [92.30805227803688]
We propose a way to augment human-AI collaboration by building on a common organizational practice: identifying experts who are likely to provide complementary opinions.
The proposed approach aims to leverage productive disagreement by identifying whether some experts are likely to disagree with an algorithmic assessment.
arXiv Detail & Related papers (2022-04-29T20:35:07Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - Crowdsourcing Evaluation of Saliency-based XAI Methods [18.18238526746074]
We propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods.
Our method is inspired by a human computation game, "Peek-a-boom"
We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes.
arXiv Detail & Related papers (2021-06-27T17:37:53Z) - Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork [54.309495231017344]
We argue that AI systems should be trained in a human-centered manner, directly optimized for team performance.
We study this proposal for a specific type of human-AI teaming, where the human overseer chooses to either accept the AI recommendation or solve the task themselves.
Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the most accuracy AI may not lead to highest team performance.
arXiv Detail & Related papers (2020-04-27T19:06:28Z) - Effect of Confidence and Explanation on Accuracy and Trust Calibration
in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI.
We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.