On Evaluating Explanation Utility for Human-AI Decision Making in NLP
- URL: http://arxiv.org/abs/2407.03545v2
- Date: Tue, 05 Nov 2024 01:38:45 GMT
- Title: On Evaluating Explanation Utility for Human-AI Decision Making in NLP
- Authors: Fateme Hashemi Chaleshtori, Atreya Ghosal, Alexander Gill, Purbid Bambroo, Ana Marasović,
- Abstract summary: We review existing metrics suitable for application-grounded evaluation.
We demonstrate the importance of reassessing the state of the art to form and study human-AI teams.
- Score: 39.58317527488534
- License:
- Abstract: Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations help people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies. To aid with this, we first review existing metrics suitable for application-grounded evaluation. We then establish criteria to select appropriate datasets, and using them, we find that only 4 out of over 50 datasets available for explainability research in NLP meet them. We then demonstrate the importance of reassessing the state of the art to form and study human-AI teams: teaming people with models for certain tasks might only now start to make sense, and for others, it remains unsound. Finally, we present the exemplar studies of human-AI decision-making for one of the identified tasks -- verifying the correctness of a legal claim given a contract. Our results show that providing AI predictions, with or without explanations, does not cause decision makers to speed up their work without compromising performance. We argue for revisiting the setup of human-AI teams and improving automatic deferral of instances to AI, where explanations could play a useful role.
Related papers
- CAUS: A Dataset for Question Generation based on Human Cognition Leveraging Large Language Models [4.962252439662465]
We introduce the Curious About Uncertain Scene dataset to enable Large Language Models to emulate human cognitive processes for resolving uncertainties.
Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries.
Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions.
arXiv Detail & Related papers (2024-04-18T01:31:19Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - Notion of Explainable Artificial Intelligence -- An Empirical
Investigation from A Users Perspective [0.3069335774032178]
This study aims to investigate usercentric explainable AI and considered recommendation systems as the study context.
We conducted focus group interviews to collect qualitative data on the recommendation system.
Our findings reveal that end users want a non-technical and tailor-made explanation with on-demand supplementary information.
arXiv Detail & Related papers (2023-11-01T22:20:14Z) - Training Towards Critical Use: Learning to Situate AI Predictions
Relative to Human Knowledge [22.21959942886099]
We introduce a process-oriented notion of appropriate reliance called critical use that centers the human's ability to situate AI predictions against knowledge that is uniquely available to them but unavailable to the AI model.
We conduct a randomized online experiment in a complex social decision-making setting: child maltreatment screening.
We find that, by providing participants with accelerated, low-stakes opportunities to practice AI-assisted decision-making, novices came to exhibit patterns of disagreement with AI that resemble those of experienced workers.
arXiv Detail & Related papers (2023-08-30T01:54:31Z) - In Search of Verifiability: Explanations Rarely Enable Complementary
Performance in AI-Advised Decision Making [25.18203172421461]
We argue explanations are only useful to the extent that they allow a human decision maker to verify the correctness of an AI's prediction.
We also compare the objective of complementary performance with that of appropriate reliance, decomposing the latter into the notions of outcome-graded and strategy-graded reliance.
arXiv Detail & Related papers (2023-05-12T18:28:04Z) - Assisting Human Decisions in Document Matching [52.79491990823573]
We devise a proxy matching task that allows us to evaluate which kinds of assistive information improve decision makers' performance.
We find that providing black-box model explanations reduces users' accuracy on the matching task.
On the other hand, custom methods that are designed to closely attend to some task-specific desiderata are found to be effective in improving user performance.
arXiv Detail & Related papers (2023-02-16T17:45:20Z) - Human-Centric Multimodal Machine Learning: Recent Advances and Testbed
on AI-based Recruitment [66.91538273487379]
There is a certain consensus about the need to develop AI applications with a Human-Centric approach.
Human-Centric Machine Learning needs to be developed based on four main requirements: (i) utility and social good; (ii) privacy and data ownership; (iii) transparency and accountability; and (iv) fairness in AI-driven decision-making processes.
We study how current multimodal algorithms based on heterogeneous sources of information are affected by sensitive elements and inner biases in the data.
arXiv Detail & Related papers (2023-02-13T16:44:44Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - The Role of AI in Drug Discovery: Challenges, Opportunities, and
Strategies [97.5153823429076]
The benefits, challenges and drawbacks of AI in this field are reviewed.
The use of data augmentation, explainable AI, and the integration of AI with traditional experimental methods are also discussed.
arXiv Detail & Related papers (2022-12-08T23:23:39Z) - Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems [14.940404609343432]
We evaluate two currently common techniques for evaluating XAI systems.
We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks.
Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
arXiv Detail & Related papers (2020-01-22T22:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.