Related papers: Evaluating Trustworthiness of AI-Enabled Decision Support Systems: Validation of the Multisource AI Scorecard Table (MAST)

Evaluating Trustworthiness of AI-Enabled Decision Support Systems: Validation of the Multisource AI Scorecard Table (MAST)

URL: http://arxiv.org/abs/2311.18040v1
Date: Wed, 29 Nov 2023 19:34:15 GMT
Title: Evaluating Trustworthiness of AI-Enabled Decision Support Systems: Validation of the Multisource AI Scorecard Table (MAST)
Authors: Pouria Salehi, Yang Ba, Nayoung Kim, Ahmadreza Mosallanezhad, Anna Pan, Myke C. Cohen, Yixuan Wang, Jieqiong Zhao, Shawaiz Bhatti, James Sung, Erik Blasch, Michelle V. Mancenido, Erin K. Chiou
Abstract summary: The Multisource AI Scorecard Table (MAST) is a checklist tool to inform the design and evaluation of trustworthy AI systems. We evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems.
Score: 10.983659980278926
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Multisource AI Scorecard Table (MAST) is a checklist tool based on analytic tradecraft standards to inform the design and evaluation of trustworthy AI systems. In this study, we evaluate whether MAST is associated with people's trust perceptions in AI-enabled decision support systems (AI-DSSs). Evaluating trust in AI-DSSs poses challenges to researchers and practitioners. These challenges include identifying the components, capabilities, and potential of these systems, many of which are based on the complex deep learning algorithms that drive DSS performance and preclude complete manual inspection. We developed two interactive, AI-DSS test environments using the MAST criteria. One emulated an identity verification task in security screening, and another emulated a text summarization system to aid in an investigative reporting task. Each test environment had one version designed to match low-MAST ratings, and another designed to match high-MAST ratings, with the hypothesis that MAST ratings would be positively related to the trust ratings of these systems. A total of 177 subject matter experts were recruited to interact with and evaluate these systems. Results generally show higher MAST ratings for the high-MAST conditions compared to the low-MAST groups, and that measures of trust perception are highly correlated with the MAST ratings. We conclude that MAST can be a useful tool for designing and evaluating systems that will engender high trust perceptions, including AI-DSS that may be used to support visual screening and text summarization tasks. However, higher MAST ratings may not translate to higher joint performance.

Related papers

Why Do Multi-Agent LLM Systems Fail? [91.39266556855513]
We present MAST (Multi-Agent System Failure taxonomy), the first empirically grounded taxonomy designed to understand MAS failures. We analyze seven popular MAS frameworks across over 200 tasks, involving six expert human annotators. We identify 14 unique failure modes, organized into 3 overarching categories, (i) specification issues, (ii) inter-agent misalignment, and (iii) task verification.
arXiv Detail & Related papers (2025-03-17T19:04:38Z)
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons [62.374792825813394]
This paper introduces AILuminate v1.0, the first comprehensive industry-standard benchmark for assessing AI-product risk and reliability. The benchmark evaluates an AI system's resistance to prompts designed to elicit dangerous, illegal, or undesirable behavior in 12 hazard categories.
arXiv Detail & Related papers (2025-02-19T05:58:52Z)
Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation [2.2241228857601727]
This paper presents an interdisciplinary meta-review of about 100 studies that discuss shortcomings in quantitative benchmarking practices. It brings together many fine-grained issues in the design and application of benchmarks with broader sociotechnical issues. Our review also highlights a series of systemic flaws in current practices, such as misaligned incentives, construct validity issues, unknown unknowns, and problems with the gaming of benchmark results.
arXiv Detail & Related papers (2025-02-10T15:25:06Z)
AI-Compass: A Comprehensive and Effective Multi-module Testing Tool for AI Systems [26.605694684145313]
In this study, we design and implement a testing tool, tool, to comprehensively and effectively evaluate AI systems. The tool extensively assesses adversarial robustness, model interpretability, and performs neuron analysis. Our research sheds light on a general solution for AI systems testing landscape.
arXiv Detail & Related papers (2024-11-09T11:15:17Z)
MMIE: Massive Multimodal Interleaved Comprehension Benchmark for Large Vision-Language Models [71.36392373876505]
We introduce MMIE, a large-scale benchmark for evaluating interleaved multimodal comprehension and generation in Large Vision-Language Models (LVLMs) MMIE comprises 20K meticulously curated multimodal queries, spanning 3 categories, 12 fields, and 102 subfields, including mathematics, coding, physics, literature, health, and arts. It supports both interleaved inputs and outputs, offering a mix of multiple-choice and open-ended question formats to evaluate diverse competencies.
arXiv Detail & Related papers (2024-10-14T04:15:00Z)
MFE-ETP: A Comprehensive Evaluation Benchmark for Multi-modal Foundation Models on Embodied Task Planning [50.45558735526665]
We provide an in-depth and comprehensive evaluation of the performance of MFMs on embodied task planning. We propose a new benchmark, named MFE-ETP, characterized its complex and variable task scenarios. Using the benchmark and evaluation platform, we evaluated several state-of-the-art MFMs and found that they significantly lag behind human-level performance.
arXiv Detail & Related papers (2024-07-06T11:07:18Z)
Testing autonomous vehicles and AI: perspectives and challenges from cybersecurity, transparency, robustness and fairness [53.91018508439669]
The study explores the complexities of integrating Artificial Intelligence into Autonomous Vehicles (AVs) It examines the challenges introduced by AI components and the impact on testing procedures. The paper identifies significant challenges and suggests future directions for research and development of AI in AV technology.
arXiv Detail & Related papers (2024-02-21T08:29:42Z)
PADTHAI-MM: A Principled Approach for Designing Trustable, Human-centered AI systems using the MAST Methodology [5.38932801848643]
The Multisource AI Scorecard Table (MAST), a checklist rating system, addresses this gap in designing and evaluating AI-enabled decision support systems. We propose the Principled Approach for Designing Trustable Human-centered AI systems using MAST methodology. We show that MAST-guided design can improve trust perceptions, and that MAST criteria can be linked to performance, process, and purpose information.
arXiv Detail & Related papers (2024-01-24T23:15:44Z)
Towards Reliable AI: Adequacy Metrics for Ensuring the Quality of System-level Testing of Autonomous Vehicles [5.634825161148484]
We introduce a set of black-box test adequacy metrics called "Test suite Instance Space Adequacy" (TISA) metrics. The TISA metrics offer a way to assess both the diversity and coverage of the test suite and the range of bugs detected during testing. We evaluate the efficacy of the TISA metrics by examining their correlation with the number of bugs detected in system-level simulation testing of AVs.
arXiv Detail & Related papers (2023-11-14T10:16:05Z)
From Static Benchmarks to Adaptive Testing: Psychometrics in AI Evaluation [60.14902811624433]
We discuss a paradigm shift from static evaluation methods to adaptive testing. This involves estimating the characteristics and value of each test item in the benchmark and dynamically adjusting items in real-time. We analyze the current approaches, advantages, and underlying reasons for adopting psychometrics in AI evaluation.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
Benchmarking Quality-Dependent and Cost-Sensitive Score-Level Multimodal Biometric Fusion Algorithms [58.156733807470395]
This paper reports a benchmarking study carried out within the framework of the BioSecure DS2 (Access Control) evaluation campaign. The campaign targeted the application of physical access control in a medium-size establishment with some 500 persons. To the best of our knowledge, this is the first attempt to benchmark quality-based multimodal fusion algorithms.
arXiv Detail & Related papers (2021-11-17T13:39:48Z)
Statistical Perspectives on Reliability of Artificial Intelligence Systems [6.284088451820049]
We provide statistical perspectives on the reliability of AI systems. We introduce a so-called SMART statistical framework for AI reliability research. We discuss recent developments in modeling and analysis of AI reliability.
arXiv Detail & Related papers (2021-11-09T20:00:14Z)
Multisource AI Scorecard Table for System Evaluation [3.74397577716445]
The paper describes a Multisource AI Scorecard Table (MAST) that provides the developer and user of an artificial intelligence (AI)/machine learning (ML) system with a standard checklist. The paper explores how the analytic tradecraft standards outlined in Intelligence Community Directive (ICD) 203 can provide a framework for assessing the performance of an AI system.
arXiv Detail & Related papers (2021-02-08T03:37:40Z)
SMT-based Safety Verification of Parameterised Multi-Agent Systems [78.04236259129524]
We study the verification of parameterised multi-agent systems (MASs) In particular, we study whether unwanted states, characterised as a given state formula, are reachable in a given MAS.
arXiv Detail & Related papers (2020-08-11T15:24:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.