Related papers: On Evaluating Explanation Utility for Human-AI Decision Making in NLP

On Evaluating Explanation Utility for Human-AI Decision Making in NLP

URL: http://arxiv.org/abs/2407.03545v1
Date: Wed, 3 Jul 2024 23:53:27 GMT
Title: On Evaluating Explanation Utility for Human-AI Decision Making in NLP
Authors: Fateme Hashemi Chaleshtori, Atreya Ghosal, Alexander Gill, Purbid Bambroo, Ana Marasović,
Abstract summary: We review existing metrics and establish requirements for datasets to be suitable for application-grounded evaluations. We demonstrate the importance of reassessing the state of the art to form and study human-AI teams.
Score: 39.58317527488534
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations aid people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies. To help with this, we first review fitting existing metrics. We then establish requirements for datasets to be suitable for application-grounded evaluations. Among over 50 datasets available for explainability research in NLP, we find that 4 meet our criteria. By finetuning Flan-T5-3B, we demonstrate the importance of reassessing the state of the art to form and study human-AI teams. Finally, we present the exemplar studies of human-AI decision-making for one of the identified suitable tasks -- verifying the correctness of a legal claim given a contract.

Related papers

The AI Imperative: Scaling High-Quality Peer Review in Machine Learning [49.87236114682497]
We argue that AI-assisted peer review must become an urgent research and infrastructure priority.<n>We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making.
arXiv Detail & Related papers (2025-06-09T18:37:14Z)
On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z)
Objective Metrics for Human-Subjects Evaluation in Explainable Reinforcement Learning [0.47355466227925036]
Explanation is a fundamentally human process. Understanding the goal and audience of the explanation is vital. Existing work on explainable reinforcement learning (XRL) routinely does not consult humans in their evaluations. This paper calls on researchers to use objective human metrics for explanation evaluations based on observable and actionable behaviour.
arXiv Detail & Related papers (2025-01-31T16:12:23Z)
CAUS: A Dataset for Question Generation based on Human Cognition Leveraging Large Language Models [4.962252439662465]
We introduce the Curious About Uncertain Scene dataset to enable Large Language Models to emulate human cognitive processes for resolving uncertainties. Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries. Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions.
arXiv Detail & Related papers (2024-04-18T01:31:19Z)
Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development. To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps. These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z)
Notion of Explainable Artificial Intelligence -- An Empirical Investigation from A Users Perspective [0.3069335774032178]
This study aims to investigate usercentric explainable AI and considered recommendation systems as the study context. We conducted focus group interviews to collect qualitative data on the recommendation system. Our findings reveal that end users want a non-technical and tailor-made explanation with on-demand supplementary information.
arXiv Detail & Related papers (2023-11-01T22:20:14Z)
Training Towards Critical Use: Learning to Situate AI Predictions Relative to Human Knowledge [22.21959942886099]
We introduce a process-oriented notion of appropriate reliance called critical use that centers the human's ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model. We conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening. We find that, by providing participants with accelerated, low-stakes opportunities to practice AI-assisted decision-making, novices came to exhibit patterns of disagreement with AI that resemble those of experienced workers.
arXiv Detail & Related papers (2023-08-30T01:54:31Z)
In Search of Verifiability: Explanations Rarely Enable Complementary Performance in AI-Advised Decision Making [25.18203172421461]
We argue explanations are only useful to the extent that they allow a human decision maker to verify the correctness of an AI's prediction. We also compare the objective of complementary performance with that of appropriate reliance, decomposing the latter into the notions of outcome-graded and strategy-graded reliance.
arXiv Detail & Related papers (2023-05-12T18:28:04Z)
Assisting Human Decisions in Document Matching [52.79491990823573]
We devise a proxy matching task that allows us to evaluate which kinds of assistive information improve decision makers' performance. We find that providing black-box model explanations reduces users' accuracy on the matching task. On the other hand, custom methods that are designed to closely attend to some task-specific desiderata are found to be effective in improving user performance.
arXiv Detail & Related papers (2023-02-16T17:45:20Z)
Human-Centric Multimodal Machine Learning: Recent Advances and Testbed on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach. Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes. We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z)
ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers. These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult. We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z)
The Role of AI in Drug Discovery: Challenges, Opportunities, and Strategies [97.5153823429076]
The benefits, challenges and drawbacks of AI in this field are reviewed. The use of data augmentation, explainable AI, and the integration of AI with traditional experimental methods are also discussed.
arXiv Detail & Related papers (2022-12-08T23:23:39Z)
Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating Explainable AI Systems [14.940404609343432]
We evaluate two currently common techniques for evaluating XAI systems. We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks. Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
arXiv Detail & Related papers (2020-01-22T22:14:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.