Related papers: How to Evaluate Explainability? -- A Case for Three Criteria

How to Evaluate Explainability? -- A Case for Three Criteria

URL: http://arxiv.org/abs/2209.00366v1
Date: Thu, 1 Sep 2022 11:22:50 GMT
Title: How to Evaluate Explainability? -- A Case for Three Criteria
Authors: Timo Speith
Abstract summary: We will provide a multidisciplinary motivation for three quality criteria concerning the information that systems should provide. Our aim is to fuel the discussion regarding these criteria, such as adequate evaluation methods for them will be conceived.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The increasing complexity of software systems and the influence of software-supported decisions in our society have sparked the need for software that is safe, reliable, and fair. Explainability has been identified as a means to achieve these qualities. It is recognized as an emerging non-functional requirement (NFR) that has a significant impact on system quality. However, in order to develop explainable systems, we need to understand when a system satisfies this NFR. To this end, appropriate evaluation methods are required. However, the field is crowded with evaluation methods, and there is no consensus on which are the "right" ones. Much less, there is not even agreement on which criteria should be evaluated. In this vision paper, we will provide a multidisciplinary motivation for three such quality criteria concerning the information that systems should provide: comprehensibility, fidelity, and assessability. Our aim is to to fuel the discussion regarding these criteria, such that adequate evaluation methods for them will be conceived.

Related papers

Lost in Vagueness: Towards Context-Sensitive Standards for Robustness Assessment under the EU AI Act [2.740981829798319]
Robustness is a key requirement for high-risk AI systems under the EU Artificial Intelligence Act (AI Act)<n>This paper investigates what it means for AI systems to be robust and illustrates the need for context-sensitive standardisation.
arXiv Detail & Related papers (2025-11-19T17:06:36Z)
Criteria for Credible AI-assisted Carbon Footprinting Systems: The Cases of Mapping and Lifecycle Modeling [0.0]
We present a set of criteria to validate AI-assisted systems that calculate greenhouse gas (GHG) emissions for products and materials.<n>This approach may be used as a foundation for practitioners, auditors, and standards bodies to evaluate AI-assisted environmental assessment tools.
arXiv Detail & Related papers (2025-08-29T21:05:19Z)
Automatic Reviewers Fail to Detect Faulty Reasoning in Research Papers: A New Counterfactual Evaluation Framework [55.078301794183496]
We focus on a core reviewing skill that underpins high-quality peer review: detecting faulty research logic.<n>This involves evaluating the internal consistency between a paper's results, interpretations, and claims.<n>We present a fully automated counterfactual evaluation framework that isolates and tests this skill under controlled conditions.
arXiv Detail & Related papers (2025-08-29T08:48:00Z)
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models [53.12387628636912]
We propose an automatic evaluation framework that is validated against human annotations. This approach was originally developed for the TREC Question Answering (QA) Track in 2003. We observe strong agreement at the run level between scores derived from fully automatic nugget evaluation and human-based variants.
arXiv Detail & Related papers (2025-04-21T12:55:06Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Pessimistic Evaluation [58.736490198613154]
We argue that evaluating information access systems assumes utilitarian values not aligned with traditions of information access based on equal access. We advocate for pessimistic evaluation of information access systems focusing on worst case utility.
arXiv Detail & Related papers (2024-10-17T15:40:09Z)
Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems [88.80306881112313]
We will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them.
arXiv Detail & Related papers (2024-05-10T17:38:32Z)
Functional trustworthiness of AI systems by statistically valid testing [7.717286312400472]
The authors are concerned about the safety, health, and rights of the European citizens due to inadequate measures and procedures required by the current draft of the EU Artificial Intelligence (AI) Act. We observe that not only the current draft of the EU AI Act, but also the accompanying standardization efforts in CEN/CENELEC, have resorted to the position that real functional guarantees of AI systems supposedly would be unrealistic and too complex anyways.
arXiv Detail & Related papers (2023-10-04T11:07:52Z)
A New Perspective on Evaluation Methods for Explainable Artificial Intelligence (XAI) [0.0]
We argue that it is best approached in a nuanced way that incorporates resource availability, domain characteristics, and considerations of risk. This work aims to advance the field of Requirements Engineering for AI.
arXiv Detail & Related papers (2023-07-26T15:15:44Z)
Revisiting the Performance-Explainability Trade-Off in Explainable Artificial Intelligence (XAI) [0.0]
We argue that it is best approached in a nuanced way that incorporates resource availability, domain characteristics, and considerations of risk. This work aims to advance the field of Requirements Engineering for AI.
arXiv Detail & Related papers (2023-07-26T15:07:40Z)
Towards Clear Expectations for Uncertainty Estimation [64.20262246029286]
Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML) Most UQ methods suffer from disparate and inconsistent evaluation protocols. This opinion paper offers a new perspective by specifying those requirements through five downstream tasks.
arXiv Detail & Related papers (2022-07-27T07:50:57Z)
Towards a multi-stakeholder value-based assessment framework for algorithmic systems [76.79703106646967]
We develop a value-based assessment framework that visualizes closeness and tensions between values. We give guidelines on how to operationalize them, while opening up the evaluation and deliberation process to a wide range of stakeholders.
arXiv Detail & Related papers (2022-05-09T19:28:32Z)
Tailored Uncertainty Estimation for Deep Learning Systems [10.288326973530614]
We propose a framework that guides the selection of a suitable uncertainty estimation method. Our framework provides strategies to validate this choice and to uncover structural weaknesses. It anticipates prospective machine learning regulations that require evidences for the technical appropriateness of machine learning systems.
arXiv Detail & Related papers (2022-04-29T09:23:07Z)
How Trustworthy are Performance Evaluations for Basic Vision Tasks? [46.0590176230731]
This paper examines performance evaluation criteria for basic vision tasks involving sets of objects namely, object detection, instance-level segmentation and multi-object tracking. The rankings of algorithms by an existing criterion can fluctuate with different choices of parameters, making their evaluations unreliable. This work suggests a notion of trustworthiness for performance criteria, which requires (i) robustness to parameters for reliability, (ii) contextual meaningfulness in sanity tests, and (iii) consistency with mathematical requirements such as the metric properties.
arXiv Detail & Related papers (2020-08-08T14:21:15Z)
Towards Faithfully Interpretable NLP Systems: How should we define and evaluate faithfulness? [58.13152510843004]
With the growing popularity of deep-learning based NLP models, comes a need for interpretable systems. What is interpretability, and what constitutes a high-quality interpretation? We call for more clearly differentiating between different desired criteria an interpretation should satisfy, and focus on the faithfulness criteria.
arXiv Detail & Related papers (2020-04-07T20:15:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.