Evaluating Trustworthiness of AI-Enabled Decision Support Systems:
Validation of the Multisource AI Scorecard Table (MAST)
- URL: http://arxiv.org/abs/2311.18040v1
- Date: Wed, 29 Nov 2023 19:34:15 GMT
- Title: Evaluating Trustworthiness of AI-Enabled Decision Support Systems:
Validation of the Multisource AI Scorecard Table (MAST)
- Authors: Pouria Salehi, Yang Ba, Nayoung Kim, Ahmadreza Mosallanezhad, Anna
Pan, Myke C. Cohen, Yixuan Wang, Jieqiong Zhao, Shawaiz Bhatti, James Sung,
Erik Blasch, Michelle V. Mancenido, Erin K. Chiou
- Abstract summary: The Multisource AI Scorecard Table (MAST) is a checklist tool to inform the design and evaluation of trustworthy AI systems.
We evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems.
- Score: 10.983659980278926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Multisource AI Scorecard Table (MAST) is a checklist tool based on
analytic tradecraft standards to inform the design and evaluation of
trustworthy AI systems. In this study, we evaluate whether MAST is associated
with people's trust perceptions in AI-enabled decision support systems
(AI-DSSs). Evaluating trust in AI-DSSs poses challenges to researchers and
practitioners. These challenges include identifying the components,
capabilities, and potential of these systems, many of which are based on the
complex deep learning algorithms that drive DSS performance and preclude
complete manual inspection. We developed two interactive, AI-DSS test
environments using the MAST criteria. One emulated an identity verification
task in security screening, and another emulated a text summarization system to
aid in an investigative reporting task. Each test environment had one version
designed to match low-MAST ratings, and another designed to match high-MAST
ratings, with the hypothesis that MAST ratings would be positively related to
the trust ratings of these systems. A total of 177 subject matter experts were
recruited to interact with and evaluate these systems. Results generally show
higher MAST ratings for the high-MAST conditions compared to the low-MAST
groups, and that measures of trust perception are highly correlated with the
MAST ratings. We conclude that MAST can be a useful tool for designing and
evaluating systems that will engender high trust perceptions, including AI-DSS
that may be used to support visual screening and text summarization tasks.
However, higher MAST ratings may not translate to higher joint performance.
Related papers
- MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning [50.45558735526665]
We provide an in-depth and comprehensive evaluation of the performance of MFMs on embodied task planning.
We propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios.
Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance.
arXiv Detail & Related papers (2024-07-06T11:07:18Z) - Trustworthy Artificial Intelligence in the Context of Metrology [3.2873782624127834]
We review research at the National Physical Laboratory in the area of trustworthy artificial intelligence (TAI)
We describe three broad themes of TAI: technical, socio-technical and social, which play key roles in ensuring that the developed models are trustworthy and can be relied upon to make responsible decisions.
We discuss three research areas within TAI that we are working on at NPL, and examine the certification of AI systems in terms of adherence to the characteristics of TAI.
arXiv Detail & Related papers (2024-06-14T15:23:27Z) - A-Bench: Are LMMs Masters at Evaluating AI-generated Images? [78.3699767628502]
A-Bench is a benchmark designed to diagnose whether multi-modal models (LMMs) are masters at evaluating AI-generated images (AIGIs)
Ultimately, 2,864 AIGIs from 16 text-to-image models are sampled, each paired with question-answers annotated by human experts, and tested across 18 leading LMMs.
arXiv Detail & Related papers (2024-06-05T08:55:02Z) - Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs)
It examines the challenges introduced by AI components and the impact on testing procedures.
The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z) - PADTHAI-MM: A Principled Approach for Designing Trustable,
Human-centered AI systems using the MAST Methodology [5.38932801848643]
The Multisource AI Scorecard Table (MAST), a checklist rating system, addresses this gap in designing and evaluating AI-enabled decision support systems.
We propose the Principled Approach for Designing Trustable Human-centered AI systems using MAST methodology.
We show that MAST-guided design can improve trust perceptions, and that MAST criteria can be linked to performance, process, and purpose information.
arXiv Detail & Related papers (2024-01-24T23:15:44Z) - Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of
System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics.
The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing.
We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z) - SQUARE: Automatic Question Answering Evaluation using Multiple Positive
and Negative References [73.67707138779245]
We propose a new evaluation metric: SQuArE (Sentence-level QUestion AnsweRing Evaluation)
We evaluate SQuArE on both sentence-level extractive (Answer Selection) and generative (GenQA) QA systems.
arXiv Detail & Related papers (2023-09-21T16:51:30Z) - Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal
Biometric Fusion Algorithms [58.156733807470395]
This paper reports a benchmarking study carried out within the framework of the BioSecure DS2 (Access Control) evaluation campaign.
The campaign targeted the application of physical access control in a medium-size establishment with some 500 persons.
To the best of our knowledge, this is the first attempt to benchmark quality-based multimodal fusion algorithms.
arXiv Detail & Related papers (2021-11-17T13:39:48Z) - Statistical Perspectives on Reliability of Artificial Intelligence
Systems [6.284088451820049]
We provide statistical perspectives on the reliability of AI systems.
We introduce a so-called SMART statistical framework for AI reliability research.
We discuss recent developments in modeling and analysis of AI reliability.
arXiv Detail & Related papers (2021-11-09T20:00:14Z) - Multisource AI Scorecard Table for System Evaluation [3.74397577716445]
The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist.
The paper explores how the analytic tradecraft standards outlined in Intelligence Community Directive (ICD) 203 can provide a framework for assessing the performance of an AI system.
arXiv Detail & Related papers (2021-02-08T03:37:40Z) - SMT-based Safety Verification of Parameterised Multi-Agent Systems [78.04236259129524]
We study the verification of parameterised multi-agent systems (MASs)
In particular, we study whether unwanted states, characterised as a given state formula, are reachable in a given MAS.
arXiv Detail & Related papers (2020-08-11T15:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.