Evaluating Trustworthiness of AI-Enabled Decision Support Systems:
Validation of the Multisource AI Scorecard Table (MAST)
- URL: http://arxiv.org/abs/2311.18040v1
- Date: Wed, 29 Nov 2023 19:34:15 GMT
- Title: Evaluating Trustworthiness of AI-Enabled Decision Support Systems:
Validation of the Multisource AI Scorecard Table (MAST)
- Authors: Pouria Salehi, Yang Ba, Nayoung Kim, Ahmadreza Mosallanezhad, Anna
Pan, Myke C. Cohen, Yixuan Wang, Jieqiong Zhao, Shawaiz Bhatti, James Sung,
Erik Blasch, Michelle V. Mancenido, Erin K. Chiou
- Abstract summary: The Multisource AI Scorecard Table (MAST) is a checklist tool to inform the design and evaluation of trustworthy AI systems.
We evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems.
- Score: 10.983659980278926
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Multisource AI Scorecard Table (MAST) is a checklist tool based on
analytic tradecraft standards to inform the design and evaluation of
trustworthy AI systems. In this study, we evaluate whether MAST is associated
with people's trust perceptions in AI-enabled decision support systems
(AI-DSSs). Evaluating trust in AI-DSSs poses challenges to researchers and
practitioners. These challenges include identifying the components,
capabilities, and potential of these systems, many of which are based on the
complex deep learning algorithms that drive DSS performance and preclude
complete manual inspection. We developed two interactive, AI-DSS test
environments using the MAST criteria. One emulated an identity verification
task in security screening, and another emulated a text summarization system to
aid in an investigative reporting task. Each test environment had one version
designed to match low-MAST ratings, and another designed to match high-MAST
ratings, with the hypothesis that MAST ratings would be positively related to
the trust ratings of these systems. A total of 177 subject matter experts were
recruited to interact with and evaluate these systems. Results generally show
higher MAST ratings for the high-MAST conditions compared to the low-MAST
groups, and that measures of trust perception are highly correlated with the
MAST ratings. We conclude that MAST can be a useful tool for designing and
evaluating systems that will engender high trust perceptions, including AI-DSS
that may be used to support visual screening and text summarization tasks.
However, higher MAST ratings may not translate to higher joint performance.
Related papers
- Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation [2.2241228857601727]
This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices.
It brings together many fine-grained issues in the design and application of benchmarks with broader sociotechnical issues.
Our review also highlights a series of systemic flaws in current practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results.
arXiv Detail & Related papers (2025-02-10T15:25:06Z) - AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems [26.605694684145313]
In this study, we design and implement a testing tool, tool, to comprehensively and effectively evaluate AI systems.
The tool extensively assesses adversarial robustness, model interpretability, and performs neuron analysis.
Our research sheds light on a general solution for AI systems testing landscape.
arXiv Detail & Related papers (2024-11-09T11:15:17Z) - MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs)
MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts.
It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z) - MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning [50.45558735526665]
We provide an in-depth and comprehensive evaluation of the performance of MFMs on embodied task planning.
We propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios.
Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance.
arXiv Detail & Related papers (2024-07-06T11:07:18Z) - Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs)
It examines the challenges introduced by AI components and the impact on testing procedures.
The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z) - PADTHAI-MM: Principles-based Approach for Designing Trustworthy, Human-centered AI using MAST Methodology [5.215782336985273]
The Multisource AI Scorecard Table (MAST) was designed to bridge the gap by offering a systematic, tradecraft-centered approach to evaluating AI-enabled decision support systems.
We introduce an iterative design framework called textitPrinciples-based Approach for Designing Trustworthy, Human-centered AI.
We demonstrate this framework in our development of the Reporting Assistant for Defense and Intelligence Tasks (READIT)
arXiv Detail & Related papers (2024-01-24T23:15:44Z) - Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of
System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics.
The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing.
We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z) - From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing.
This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time.
We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z) - Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal
Biometric Fusion Algorithms [58.156733807470395]
This paper reports a benchmarking study carried out within the framework of the BioSecure DS2 (Access Control) evaluation campaign.
The campaign targeted the application of physical access control in a medium-size establishment with some 500 persons.
To the best of our knowledge, this is the first attempt to benchmark quality-based multimodal fusion algorithms.
arXiv Detail & Related papers (2021-11-17T13:39:48Z) - Multisource AI Scorecard Table for System Evaluation [3.74397577716445]
The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist.
The paper explores how the analytic tradecraft standards outlined in Intelligence Community Directive (ICD) 203 can provide a framework for assessing the performance of an AI system.
arXiv Detail & Related papers (2021-02-08T03:37:40Z) - SMT-based Safety Verification of Parameterised Multi-Agent Systems [78.04236259129524]
We study the verification of parameterised multi-agent systems (MASs)
In particular, we study whether unwanted states, characterised as a given state formula, are reachable in a given MAS.
arXiv Detail & Related papers (2020-08-11T15:24:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.