Bias in the Loop: How Humans Evaluate AI-Generated Suggestions
- URL: http://arxiv.org/abs/2509.08514v1
- Date: Wed, 10 Sep 2025 11:43:29 GMT
- Title: Bias in the Loop: How Humans Evaluate AI-Generated Suggestions
- Authors: Jacob Beck, Stephanie Eckman, Christoph Kern, Frauke Kreuter,
- Abstract summary: Human-AI collaboration increasingly drives decision-making across industries, from medical diagnosis to content moderation.<n>We know little about the psychological factors that determine when these collaborations succeed or fail.<n>We conducted a randomized experiment with 2,784 participants to examine how task design and individual characteristics shape human responses to AI-generated suggestions.
- Score: 9.578382668831988
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human-AI collaboration increasingly drives decision-making across industries, from medical diagnosis to content moderation. While AI systems promise efficiency gains by providing automated suggestions for human review, these workflows can trigger cognitive biases that degrade performance. We know little about the psychological factors that determine when these collaborations succeed or fail. We conducted a randomized experiment with 2,784 participants to examine how task design and individual characteristics shape human responses to AI-generated suggestions. Using a controlled annotation task, we manipulated three factors: AI suggestion quality in the first three instances, task burden through required corrections, and performance-based financial incentives. We collected demographics, attitudes toward AI, and behavioral data to assess four performance metrics: accuracy, correction activity, overcorrection, and undercorrection. Two patterns emerged that challenge conventional assumptions about human-AI collaboration. First, requiring corrections for flagged AI errors reduced engagement and increased the tendency to accept incorrect suggestions, demonstrating how cognitive shortcuts influence collaborative outcomes. Second, individual attitudes toward AI emerged as the strongest predictor of performance, surpassing demographic factors. Participants skeptical of AI detected errors more reliably and achieved higher accuracy, while those favorable toward automation exhibited dangerous overreliance on algorithmic suggestions. The findings reveal that successful human-AI collaboration depends not only on algorithmic performance but also on who reviews AI outputs and how review processes are structured. Effective human-AI collaborations require consideration of human psychology: selecting diverse evaluator samples, measuring attitudes, and designing workflows that counteract cognitive biases.
Related papers
- When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration [79.69935257008467]
We introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities.<n>We conduct the first large-scale human study (N=118) explicitly designed to measure it.<n>In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding.
arXiv Detail & Related papers (2025-06-05T20:48:16Z) - On Benchmarking Human-Like Intelligence in Machines [77.55118048492021]
We argue that current AI evaluation paradigms are insufficient for assessing human-like cognitive capabilities.<n>We identify a set of key shortcomings: a lack of human-validated labels, inadequate representation of human response variability and uncertainty, and reliance on simplified and ecologically-invalid tasks.
arXiv Detail & Related papers (2025-02-27T20:21:36Z) - Engaging with AI: How Interface Design Shapes Human-AI Collaboration in High-Stakes Decision-Making [8.948482790298645]
We examine how various decision-support mechanisms impact user engagement, trust, and human-AI collaborative task performance.<n>Our findings reveal that mechanisms like AI confidence levels, text explanations, and performance visualizations enhanced human-AI collaborative task performance.
arXiv Detail & Related papers (2025-01-28T02:03:00Z) - How Performance Pressure Influences AI-Assisted Decision Making [52.997197698288936]
We show how pressure and explainable AI (XAI) techniques interact with AI advice-taking behavior.<n>Our results show complex interaction effects, with different combinations of pressure and XAI techniques either improving or worsening AI advice taking behavior.
arXiv Detail & Related papers (2024-10-21T22:39:52Z) - To Err Is AI! Debugging as an Intervention to Facilitate Appropriate Reliance on AI Systems [11.690126756498223]
Vision for optimal human-AI collaboration requires 'appropriate reliance' of humans on AI systems.
In practice, the performance disparity of machine learning models on out-of-distribution data makes dataset-specific performance feedback unreliable.
arXiv Detail & Related papers (2024-09-22T09:43:27Z) - Beyond Recommender: An Exploratory Study of the Effects of Different AI
Roles in AI-Assisted Decision Making [48.179458030691286]
We examine three AI roles: Recommender, Analyzer, and Devil's Advocate.
Our results show each role's distinct strengths and limitations in task performance, reliance appropriateness, and user experience.
These insights offer valuable implications for designing AI assistants with adaptive functional roles according to different situations.
arXiv Detail & Related papers (2024-03-04T07:32:28Z) - Improving Human-AI Collaboration With Descriptions of AI Behavior [14.904401331154062]
People work with AI systems to improve their decision making, but often under- or over-rely on AI predictions and perform worse than they would have unassisted.
To help people appropriately rely on AI aids, we propose showing them behavior descriptions.
arXiv Detail & Related papers (2023-01-06T00:33:08Z) - Advancing Human-AI Complementarity: The Impact of User Expertise and
Algorithmic Tuning on Joint Decision Making [10.890854857970488]
Many factors can impact success of Human-AI teams, including a user's domain expertise, mental models of an AI system, trust in recommendations, and more.
Our study examined user performance in a non-trivial blood vessel labeling task where participants indicated whether a given blood vessel was flowing or stalled.
Our results show that while recommendations from an AI-Assistant can aid user decision making, factors such as users' baseline performance relative to the AI and complementary tuning of AI error types significantly impact overall team performance.
arXiv Detail & Related papers (2022-08-16T21:39:58Z) - Deciding Fast and Slow: The Role of Cognitive Biases in AI-assisted
Decision-making [46.625616262738404]
We use knowledge from the field of cognitive science to account for cognitive biases in the human-AI collaborative decision-making setting.
We focus specifically on anchoring bias, a bias commonly encountered in human-AI collaboration.
arXiv Detail & Related papers (2020-10-15T22:25:41Z) - Proxy Tasks and Subjective Measures Can Be Misleading in Evaluating
Explainable AI Systems [14.940404609343432]
We evaluate two currently common techniques for evaluating XAI systems.
We show that evaluations with proxy tasks did not predict the results of the evaluations with the actual decision-making tasks.
Our results suggest that by employing misleading evaluation methods, our field may be inadvertently slowing its progress toward developing human+AI teams that can reliably perform better than humans or AIs alone.
arXiv Detail & Related papers (2020-01-22T22:14:28Z) - Effect of Confidence and Explanation on Accuracy and Trust Calibration
in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI.
We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.