Designing Disaggregated Evaluations of AI Systems: Choices,
  Considerations, and Tradeoffs
        - URL: http://arxiv.org/abs/2103.06076v1
 - Date: Wed, 10 Mar 2021 14:26:14 GMT
 - Title: Designing Disaggregated Evaluations of AI Systems: Choices,
  Considerations, and Tradeoffs
 - Authors: Solon Barocas, Anhong Guo, Ece Kamar, Jacquelyn Krones, Meredith
  Ringel Morris, Jennifer Wortman Vaughan, Duncan Wadsworth, Hanna Wallach
 - Abstract summary: We argue that a deeper understanding of the choices, considerations, and tradeoffs involved in designing disaggregated evaluations will better enable researchers, practitioners, and the public to understand the ways in which AI systems may be underperforming for particular groups of people.
 - Score: 42.401239658653914
 - License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
 - Abstract:   Several pieces of work have uncovered performance disparities by conducting
"disaggregated evaluations" of AI systems. We build on these efforts by
focusing on the choices that must be made when designing a disaggregated
evaluation, as well as some of the key considerations that underlie these
design choices and the tradeoffs between these considerations. We argue that a
deeper understanding of the choices, considerations, and tradeoffs involved in
designing disaggregated evaluations will better enable researchers,
practitioners, and the public to understand the ways in which AI systems may be
underperforming for particular groups of people.
 
       
      
        Related papers
        - A Conceptual Framework for AI Capability Evaluations [0.0]
We propose a conceptual framework for analyzing AI capability evaluations.<n>It offers a structured, descriptive approach that systematizes the analysis of widely used methods and terminology.<n>It also enables researchers to identify methodological weaknesses, assists practitioners in designing evaluations, and provides policymakers with a tool to scrutinize, compare, and navigate complex evaluation landscapes.
arXiv  Detail & Related papers  (2025-06-23T00:19:27Z) - AI Automatons: AI Systems Intended to Imitate Humans [54.19152688545896]
There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness.
The research, design, deployment, and availability of such AI systems have prompted growing concerns about a wide range of possible legal, ethical, and other social impacts.
arXiv  Detail & Related papers  (2025-03-04T03:55:38Z) - On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.
We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv  Detail & Related papers  (2025-02-27T20:21:36Z) - Re-evaluating Open-ended Evaluation of Large Language Models [50.23008729038318]
We show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental.
We propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy.
arXiv  Detail & Related papers  (2025-02-27T15:07:47Z) - The Value of AI Advice: Personalized and Value-Maximizing AI Advisors   Are Necessary to Reliably Benefit Experts and Organizations [8.434663608756253]
Despite advances in AI's performance, AI advisors can undermine experts' decisions and increase the time and effort experts must invest to make decisions.
We stress the importance of assessing the value AI advice brings to real-world contexts when designing and evaluating AI advisors.
Our results highlight the need for system-level, value-driven development of AI advisors that advise selectively, are tailored to experts' unique behaviors, and are optimized for context-specific trade-offs between decision improvements and advising costs.
arXiv  Detail & Related papers  (2024-12-27T08:50:54Z) - Towards Objective and Unbiased Decision Assessments with LLM-Enhanced   Hierarchical Attention Networks [6.520709313101523]
This work investigates cognitive bias identification in high-stake decision making process by human experts.
We propose bias-aware AI-augmented workflow that surpass human judgment.
In our experiments, both the proposed model and the agentic workflow significantly improves on both human judgment and alternative models.
arXiv  Detail & Related papers  (2024-11-13T10:42:11Z) - Negotiating the Shared Agency between Humans & AI in the Recommender   System [1.4249472316161877]
Concerns about user agency have arisen due to the inherent opacity (information asymmetry) and the nature of one-way output (power asymmetry) on algorithms.
We seek to understand how types of agency impact user perception and experience, and bring empirical evidence to refine the guidelines and designs for human-AI interactive systems.
arXiv  Detail & Related papers  (2024-03-23T19:23:08Z) - Beyond Recommender: An Exploratory Study of the Effects of Different AI
  Roles in AI-Assisted Decision Making [48.179458030691286]
We examine three AI roles: Recommender, Analyzer, and Devil's Advocate.
Our results show each role's distinct strengths and limitations in task performance, reliance appropriateness, and user experience.
These insights offer valuable implications for designing AI assistants with adaptive functional roles according to different situations.
arXiv  Detail & Related papers  (2024-03-04T07:32:28Z) - Evaluative Item-Contrastive Explanations in Rankings [47.24529321119513]
This paper advocates for the application of a specific form of Explainable AI -- namely, contrastive explanations -- as well-suited for addressing ranking problems.
The present work introduces Evaluative Item-Contrastive Explanations tailored for ranking systems and illustrates its application and characteristics through an experiment conducted on publicly available data.
arXiv  Detail & Related papers  (2023-12-14T15:40:51Z) - Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments.
It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv  Detail & Related papers  (2023-04-13T13:08:38Z) - Video Surveillance System Incorporating Expert Decision-making Process:
  A Case Study on Detecting Calving Signs in Cattle [5.80793470875286]
In this study, we examine the framework of a video surveillance AI system that presents the reasoning behind predictions by incorporating experts' decision-making processes with rich domain knowledge of the notification target.
In our case study, we designed a system for detecting signs of calving in cattle based on the proposed framework and evaluated the system through a user study with people involved in livestock farming.
arXiv  Detail & Related papers  (2023-01-10T12:06:49Z) - Doubting AI Predictions: Influence-Driven Second Opinion Recommendation [92.30805227803688]
We propose a way to augment human-AI collaboration by building on a common organizational practice: identifying experts who are likely to provide complementary opinions.
The proposed approach aims to leverage productive disagreement by identifying whether some experts are likely to disagree with an algorithmic assessment.
arXiv  Detail & Related papers  (2022-04-29T20:35:07Z) - AI for human assessment: What do professional assessors need? [33.88509725285237]
This case study aims to help professional assessors make decisions in human assessment, in which they conduct interviews with assessees and evaluate their suitability for certain job roles.
A computational system that can extract nonverbal cues of assesses would be beneficial to assessors in terms of supporting their decision making.
We developed such a system based on an unsupervised anomaly detection algorithm using multimodal behavioral features such as facial keypoints, pose, head pose, and gaze.
arXiv  Detail & Related papers  (2022-04-18T03:35:37Z) - Inverse Online Learning: Understanding Non-Stationary and Reactionary
  Policies [79.60322329952453]
We show how to develop interpretable representations of how agents make decisions.
By understanding the decision-making processes underlying a set of observed trajectories, we cast the policy inference problem as the inverse to this online learning problem.
We introduce a practical algorithm for retrospectively estimating such perceived effects, alongside the process through which agents update them.
Through application to the analysis of UNOS organ donation acceptance decisions, we demonstrate that our approach can bring valuable insights into the factors that govern decision processes and how they change over time.
arXiv  Detail & Related papers  (2022-03-14T17:40:42Z) - Assessing the Fairness of AI Systems: AI Practitioners' Processes,
  Challenges, and Needs for Support [18.148737010217953]
We conduct interviews and workshops with AI practitioners to identify practitioners' processes, challenges, and needs for support.
We find that practitioners face challenges when choosing performance metrics, identifying the most relevant direct stakeholders and demographic groups.
We identify impacts on fairness work stemming from a lack of engagement with direct stakeholders, business imperatives that prioritize customers over marginalized groups, and the drive to deploy AI systems at scale.
arXiv  Detail & Related papers  (2021-12-10T17:14:34Z) 
        This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.