Related papers: Designing Disaggregated Evaluations of AI Systems: Choices, Considerations, and Tradeoffs

Related papers

A Conceptual Framework for AI Capability Evaluations [0.0]
We propose a conceptual framework for analyzing AI capability evaluations.<n>It offers a structured, descriptive approach that systematizes the analysis of widely used methods and terminology.<n>It also enables researchers to identify methodological weaknesses, assists practitioners in designing evaluations, and provides policymakers with a tool to scrutinize, compare, and navigate complex evaluation landscapes.
arXiv Detail & Related papers (2025-06-23T00:19:27Z)
AI Automatons: AI Systems Intended to Imitate Humans [54.19152688545896]
There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness. The research, design, deployment, and availability of such AI systems have prompted growing concerns about a wide range of possible legal, ethical, and other social impacts.
arXiv Detail & Related papers (2025-03-04T03:55:38Z)
On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities. We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z)
Re-evaluating Open-ended Evaluation of Large Language Models [50.23008729038318]
We show that the current Elo-based rating systems can be susceptible to and even reinforce biases in data, intentional or accidental. We propose evaluation as a 3-player game, and introduce novel game-theoretic solution concepts to ensure robustness to redundancy.
arXiv Detail & Related papers (2025-02-27T15:07:47Z)
The Value of AI Advice: Personalized and Value-Maximizing AI Advisors Are Necessary to Reliably Benefit Experts and Organizations [8.434663608756253]
Despite advances in AI's performance, AI advisors can undermine experts' decisions and increase the time and effort experts must invest to make decisions. We stress the importance of assessing the value AI advice brings to real-world contexts when designing and evaluating AI advisors. Our results highlight the need for system-level, value-driven development of AI advisors that advise selectively, are tailored to experts' unique behaviors, and are optimized for context-specific trade-offs between decision improvements and advising costs.
arXiv Detail & Related papers (2024-12-27T08:50:54Z)
Towards Objective and Unbiased Decision Assessments with LLM-Enhanced Hierarchical Attention Networks [6.520709313101523]
This work investigates cognitive bias identification in high-stake decision making process by human experts. We propose bias-aware AI-augmented workflow that surpass human judgment. In our experiments, both the proposed model and the agentic workflow significantly improves on both human judgment and alternative models.
arXiv Detail & Related papers (2024-11-13T10:42:11Z)
Negotiating the Shared Agency between Humans & AI in the Recommender System [1.4249472316161877]
Concerns about user agency have arisen due to the inherent opacity (information asymmetry) and the nature of one-way output (power asymmetry) on algorithms. We seek to understand how types of agency impact user perception and experience, and bring empirical evidence to refine the guidelines and designs for human-AI interactive systems.
arXiv Detail & Related papers (2024-03-23T19:23:08Z)
Beyond Recommender: An Exploratory Study of the Effects of Different AI Roles in AI-Assisted Decision Making [48.179458030691286]
We examine three AI roles: Recommender, Analyzer, and Devil's Advocate. Our results show each role's distinct strengths and limitations in task performance, reliance appropriateness, and user experience. These insights offer valuable implications for designing AI assistants with adaptive functional roles according to different situations.
arXiv Detail & Related papers (2024-03-04T07:32:28Z)
Evaluative Item-Contrastive Explanations in Rankings [47.24529321119513]
This paper advocates for the application of a specific form of Explainable AI -- namely, contrastive explanations -- as well-suited for addressing ranking problems. The present work introduces Evaluative Item-Contrastive Explanations tailored for ranking systems and illustrates its application and characteristics through an experiment conducted on publicly available data.
arXiv Detail & Related papers (2023-12-14T15:40:51Z)
Perspectives on Large Language Models for Relevance Judgment [56.935731584323996]
Large language models (LLMs) claim that they can assist with relevance judgments. It is not clear whether automated judgments can reliably be used in evaluations of retrieval systems.
arXiv Detail & Related papers (2023-04-13T13:08:38Z)
Video Surveillance System Incorporating Expert Decision-making Process: A Case Study on Detecting Calving Signs in Cattle [5.80793470875286]
In this study, we examine the framework of a video surveillance AI system that presents the reasoning behind predictions by incorporating experts' decision-making processes with rich domain knowledge of the notification target. In our case study, we designed a system for detecting signs of calving in cattle based on the proposed framework and evaluated the system through a user study with people involved in livestock farming.
arXiv Detail & Related papers (2023-01-10T12:06:49Z)
Doubting AI Predictions: Influence-Driven Second Opinion Recommendation [92.30805227803688]
We propose a way to augment human-AI collaboration by building on a common organizational practice: identifying experts who are likely to provide complementary opinions. The proposed approach aims to leverage productive disagreement by identifying whether some experts are likely to disagree with an algorithmic assessment.
arXiv Detail & Related papers (2022-04-29T20:35:07Z)
AI for human assessment: What do professional assessors need? [33.88509725285237]
This case study aims to help professional assessors make decisions in human assessment, in which they conduct interviews with assessees and evaluate their suitability for certain job roles. A computational system that can extract nonverbal cues of assesses would be beneficial to assessors in terms of supporting their decision making. We developed such a system based on an unsupervised anomaly detection algorithm using multimodal behavioral features such as facial keypoints, pose, head pose, and gaze.
arXiv Detail & Related papers (2022-04-18T03:35:37Z)
Inverse Online Learning: Understanding Non-Stationary and Reactionary Policies [79.60322329952453]
We show how to develop interpretable representations of how agents make decisions. By understanding the decision-making processes underlying a set of observed trajectories, we cast the policy inference problem as the inverse to this online learning problem. We introduce a practical algorithm for retrospectively estimating such perceived effects, alongside the process through which agents update them. Through application to the analysis of UNOS organ donation acceptance decisions, we demonstrate that our approach can bring valuable insights into the factors that govern decision processes and how they change over time.
arXiv Detail & Related papers (2022-03-14T17:40:42Z)
Assessing the Fairness of AI Systems: AI Practitioners' Processes, Challenges, and Needs for Support [18.148737010217953]
We conduct interviews and workshops with AI practitioners to identify practitioners' processes, challenges, and needs for support. We find that practitioners face challenges when choosing performance metrics, identifying the most relevant direct stakeholders and demographic groups. We identify impacts on fairness work stemming from a lack of engagement with direct stakeholders, business imperatives that prioritize customers over marginalized groups, and the drive to deploy AI systems at scale.
arXiv Detail & Related papers (2021-12-10T17:14:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.