Related papers: Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems

Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems

URL: http://arxiv.org/abs/2001.08298v1
Date: Wed, 22 Jan 2020 22:14:28 GMT
Title: Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems
Authors: Zana Bu\c{c}inca, Phoebe Lin, Krzysztof Z. Gajos, Elena L. Glassman
Abstract summary: We evaluate two currently common techniques for evaluating XAI systems. We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks. Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
Score: 14.940404609343432
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Explainable artificially intelligent (XAI) systems form part of sociotechnical systems, e.g., human+AI teams tasked with making decisions. Yet, current XAI systems are rarely evaluated by measuring the performance of human+AI teams on actual decision-making tasks. We conducted two online experiments and one in-person think-aloud study to evaluate two currently common techniques for evaluating XAI systems: (1) using proxy, artificial tasks such as how well humans predict the AI's decision from the given explanations, and (2) using subjective measures of trust and preference as predictors of actual performance. The results of our experiments demonstrate that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks. Further, the subjective measures on evaluations with actual decision-making tasks did not predict the objective performance on those same tasks. Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.

Related papers

Bias in the Loop: How Humans Evaluate AI-Generated Suggestions [9.578382668831988]
Human-AI collaboration increasingly drives decision-making across industries, from medical diagnosis to content moderation.<n>We know little about the psychological factors that determine when these collaborations succeed or fail.<n>We conducted a randomized experiment with 2,784 participants to examine how task design and individual characteristics shape human responses to AI-generated suggestions.
arXiv Detail & Related papers (2025-09-10T11:43:29Z)
Engaging with AI: How Interface Design Shapes Human-AI Collaboration in High-Stakes Decision-Making [8.948482790298645]
We examine how various decision-support mechanisms impact user engagement, trust, and human-AI collaborative task performance. Our findings reveal that mechanisms like AI confidence levels, text explanations, and performance visualizations enhanced human-AI collaborative task performance.
arXiv Detail & Related papers (2025-01-28T02:03:00Z)
Raising the Stakes: Performance Pressure Improves AI-Assisted Decision Making [57.53469908423318]
We show the effects of performance pressure on AI advice reliance when laypeople complete a common AI-assisted task. We find that when the stakes are high, people use AI advice more appropriately than when stakes are lower, regardless of the presence of an AI explanation.
arXiv Detail & Related papers (2024-10-21T22:39:52Z)
Study on the Helpfulness of Explainable Artificial Intelligence [0.0]
Legal, business, and ethical requirements motivate using effective XAI. We propose to evaluate XAI methods via the user's ability to successfully perform a proxy task. In other words, we address the helpfulness of XAI for human decision-making.
arXiv Detail & Related papers (2024-10-14T14:03:52Z)
To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems [11.690126756498223]
Vision for optimal human-AI collaboration requires 'appropriate reliance' of humans on AI systems. In practice, the performance disparity of machine learning models on out-of-distribution data makes dataset-specific performance feedback unreliable.
arXiv Detail & Related papers (2024-09-22T09:43:27Z)
How much informative is your XAI? A decision-making assessment task to objectively measure the goodness of explanations [53.01494092422942]
The number and complexity of personalised and user-centred approaches to XAI have rapidly grown in recent years. It emerged that user-centred approaches to XAI positively affect the interaction between users and systems. We propose an assessment task to objectively and quantitatively measure the goodness of XAI systems.
arXiv Detail & Related papers (2023-12-07T15:49:39Z)
Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human Knowledge [22.21959942886099]
We introduce a process-oriented notion of appropriate reliance called critical use that centers the human's ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model. We conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening. We find that, by providing participants with accelerated, low-stakes opportunities to practice AI-assisted decision-making, novices came to exhibit patterns of disagreement with AI that resemble those of experienced workers.
arXiv Detail & Related papers (2023-08-30T01:54:31Z)
The Impact of Imperfect XAI on Human-AI Decision-Making [8.305869611846775]
We evaluate how incorrect explanations influence humans' decision-making behavior in a bird species identification task. Our findings reveal the influence of imperfect XAI and humans' level of expertise on their reliance on AI and human-AI team performance.
arXiv Detail & Related papers (2023-07-25T15:19:36Z)
An Experimental Investigation into the Evaluation of Explainability Methods [60.54170260771932]
This work compares 14 different metrics when applied to nine state-of-the-art XAI methods and three dummy methods (e.g., random saliency maps) used as references. Experimental results show which of these metrics produces highly correlated results, indicating potential redundancy.
arXiv Detail & Related papers (2023-05-25T08:07:07Z)
Advancing Human-AI Complementarity: The Impact of User Expertise and Algorithmic Tuning on Joint Decision Making [10.890854857970488]
Many factors can impact success of Human-AI teams, including a user's domain expertise, mental models of an AI system, trust in recommendations, and more. Our study examined user performance in a non-trivial blood vessel labeling task where participants indicated whether a given blood vessel was flowing or stalled. Our results show that while recommendations from an AI-Assistant can aid user decision making, factors such as users' baseline performance relative to the AI and complementary tuning of AI error types significantly impact overall team performance.
arXiv Detail & Related papers (2022-08-16T21:39:58Z)
Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations. It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z)
Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork [54.309495231017344]
We argue that AI systems should be trained in a human-centered manner, directly optimized for team performance. We study this proposal for a specific type of human-AI teaming, where the human overseer chooses to either accept the AI recommendation or solve the task themselves. Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the most accuracy AI may not lead to highest team performance.
arXiv Detail & Related papers (2020-04-27T19:06:28Z)
Effect of Confidence and Explanation on Accuracy and Trust Calibration in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI. We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.