On Evaluating Explanation Utility for Human-AI Decision Making in NLP
- URL: http://arxiv.org/abs/2407.03545v1
- Date: Wed, 3 Jul 2024 23:53:27 GMT
- Title: On Evaluating Explanation Utility for Human-AI Decision Making in NLP
- Authors: Fateme Hashemi Chaleshtori, Atreya Ghosal, Alexander Gill, Purbid Bambroo, Ana Marasović,
- Abstract summary: We review existing metrics and establish requirements for datasets to be suitable for application-grounded evaluations.
We demonstrate the importance of reassessing the state of the art to form and study human-AI teams.
- Score: 39.58317527488534
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Is explainability a false promise? This debate has emerged from the insufficient evidence that explanations aid people in situations they are introduced for. More human-centered, application-grounded evaluations of explanations are needed to settle this. Yet, with no established guidelines for such studies in NLP, researchers accustomed to standardized proxy evaluations must discover appropriate measurements, tasks, datasets, and sensible models for human-AI teams in their studies. To help with this, we first review fitting existing metrics. We then establish requirements for datasets to be suitable for application-grounded evaluations. Among over 50 datasets available for explainability research in NLP, we find that 4 meet our criteria. By finetuning Flan-T5-3B, we demonstrate the importance of reassessing the state of the art to form and study human-AI teams. Finally, we present the exemplar studies of human-AI decision-making for one of the identified suitable tasks -- verifying the correctness of a legal claim given a contract.
Related papers
- CAUS: A Dataset for Question Generation based on Human Cognition Leveraging Large Language Models [4.962252439662465]
We introduce the Curious About Uncertain Scene dataset to enable Large Language Models to emulate human cognitive processes for resolving uncertainties.
Our approach involves providing scene descriptions embedded with uncertainties to stimulate the generation of reasoning and queries.
Our results demonstrate that GPT-4 can effectively generate pertinent questions and grasp their nuances, particularly when given appropriate context and instructions.
arXiv Detail & Related papers (2024-04-18T01:31:19Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - Notion of Explainable Artificial Intelligence -- An Empirical
Investigation from A Users Perspective [0.3069335774032178]
This study aims to investigate usercentric explainable AI and considered recommendation systems as the study context.
We conducted focus group interviews to collect qualitative data on the recommendation system.
Our findings reveal that end users want a non-technical and tailor-made explanation with on-demand supplementary information.
arXiv Detail & Related papers (2023-11-01T22:20:14Z) - SocREval: Large Language Models with the Socratic Method for Reference-Free Reasoning Evaluation [78.23119125463964]
We develop SocREval, a novel approach for prompt design in reference-free reasoning evaluation.
SocREval significantly improves GPT-4's performance, surpassing existing reference-free and reference-based reasoning evaluation metrics.
arXiv Detail & Related papers (2023-09-29T18:25:46Z) - Are Human Explanations Always Helpful? Towards Objective Evaluation of
Human Natural Language Explanations [27.624182544486334]
We build on the view that the quality of a human-annotated explanation can be measured based on its helpfulness.
We define a new metric that can take into consideration the helpfulness of an explanation for model performance.
arXiv Detail & Related papers (2023-05-04T19:31:50Z) - AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models [122.63704560157909]
We introduce AGIEval, a novel benchmark designed to assess foundation model in the context of human-centric standardized exams.
We evaluate several state-of-the-art foundation models, including GPT-4, ChatGPT, and Text-Davinci-003.
GPT-4 surpasses average human performance on SAT, LSAT, and math competitions, attaining a 95% accuracy rate on the SAT Math test and a 92.5% accuracy on the English test of the Chinese national college entrance exam.
arXiv Detail & Related papers (2023-04-13T09:39:30Z) - Revisiting the Gold Standard: Grounding Summarization Evaluation with
Robust Human Evaluation [136.16507050034755]
Existing human evaluation studies for summarization either exhibit a low inter-annotator agreement or have insufficient scale.
We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which is based on fine-grained semantic units.
We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of 22,000 summary-level annotations over 28 top-performing systems.
arXiv Detail & Related papers (2022-12-15T17:26:05Z) - ROSCOE: A Suite of Metrics for Scoring Step-by-Step Reasoning [63.77667876176978]
Large language models show improved downstream task interpretability when prompted to generate step-by-step reasoning to justify their final answers.
These reasoning steps greatly improve model interpretability and verification, but objectively studying their correctness is difficult.
We present ROS, a suite of interpretable, unsupervised automatic scores that improve and extend previous text generation evaluation metrics.
arXiv Detail & Related papers (2022-12-15T15:52:39Z) - A Survey on Methods and Metrics for the Assessment of Explainability
under the Proposed AI Act [2.294014185517203]
This study identifies the requirements that such a metric should possess to ease compliance with the AI Act.
Our analysis proposes that metrics to measure the kind of explainability endorsed by the proposed AI Act shall be risk-focused, model-agnostic, goal-aware, intelligible & accessible.
arXiv Detail & Related papers (2021-10-21T14:27:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.