Modeling Human Beliefs about AI Behavior for Scalable Oversight
- URL: http://arxiv.org/abs/2502.21262v2
- Date: Mon, 20 Oct 2025 20:33:56 GMT
- Title: Modeling Human Beliefs about AI Behavior for Scalable Oversight
- Authors: Leon Lang, Patrick Forré,
- Abstract summary: We propose modeling evaluators' beliefs to interpret their feedback more reliably.<n>We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains.<n>These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior.
- Score: 16.068386375496086
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As AI systems advance beyond human capabilities, scalable oversight becomes critical: how can we supervise AI that exceeds our abilities? A key challenge is that human evaluators may form incorrect beliefs about AI behavior in complex tasks, leading to unreliable feedback and poor value inference. To address this, we propose modeling evaluators' beliefs to interpret their feedback more reliably. We formalize human belief models, analyze their theoretical role in value learning, and characterize when ambiguity remains. To reduce reliance on precise belief models, we introduce "belief model covering" as a relaxation. This motivates our preliminary proposal to use the internal representations of adapted foundation models to mimic human evaluators' beliefs. These representations could be used to learn correct values from human feedback even when evaluators misunderstand the AI's behavior. Our work suggests that modeling human beliefs can improve value learning and outlines practical research directions for implementing this approach to scalable oversight.
Related papers
- An Approach to Grounding AI Model Evaluations in Human-derived Criteria [0.0]
We propose a novel approach to augment existing benchmarks with human-derived evaluation criteria.<n>Grounding our study in the Perception Test and OpenEQA benchmarks, we conducted in-depth interviews and large-scale surveys.<n>Our findings reveal that participants perceive AI as lacking in interpretive and empathetic skills yet hold high expectations for AI performance.
arXiv Detail & Related papers (2025-09-04T21:40:32Z) - When Models Know More Than They Can Explain: Quantifying Knowledge Transfer in Human-AI Collaboration [79.69935257008467]
We introduce Knowledge Integration and Transfer Evaluation (KITE), a conceptual and experimental framework for Human-AI knowledge transfer capabilities.<n>We conduct the first large-scale human study (N=118) explicitly designed to measure it.<n>In our two-phase setup, humans first ideate with an AI on problem-solving strategies, then independently implement solutions, isolating model explanations' influence on human understanding.
arXiv Detail & Related papers (2025-06-05T20:48:16Z) - Measurement of LLM's Philosophies of Human Nature [113.47929131143766]
We design the standardized psychological scale specifically targeting large language models (LLM)
We show that current LLMs exhibit a systemic lack of trust in humans.
We propose a mental loop learning framework, which enables LLM to continuously optimize its value system.
arXiv Detail & Related papers (2025-04-03T06:22:19Z) - The Superalignment of Superhuman Intelligence with Large Language Models [63.96120398355404]
We discuss the concept of superalignment from the learning perspective to answer this question.<n>We highlight some key research problems in superalignment, namely, weak-to-strong generalization, scalable oversight, and evaluation.<n>We present a conceptual framework for superalignment, which consists of three modules: an attacker which generates adversary queries trying to expose the weaknesses of a learner model; a learner which will refine itself by learning from scalable feedbacks generated by a critic model along with minimal human experts; and a critic which generates critics or explanations for a given query-response pair, with a target of improving the learner by criticizing.
arXiv Detail & Related papers (2024-12-15T10:34:06Z) - Modelling Human Values for AI Reasoning [2.320648715016106]
We detail a formal model of human values for their explicit computational representation.
We show how this model can provide the foundational apparatus for AI-based reasoning over values.
We propose a roadmap for future integrated, and interdisciplinary, research into human values in AI.
arXiv Detail & Related papers (2024-02-09T12:08:49Z) - Evaluating the Utility of Model Explanations for Model Development [54.23538543168767]
We evaluate whether explanations can improve human decision-making in practical scenarios of machine learning model development.
To our surprise, we did not find evidence of significant improvement on tasks when users were provided with any of the saliency maps.
These findings suggest caution regarding the usefulness and potential for misunderstanding in saliency-based explanations.
arXiv Detail & Related papers (2023-12-10T23:13:23Z) - Exploration with Principles for Diverse AI Supervision [88.61687950039662]
Training large transformers using next-token prediction has given rise to groundbreaking advancements in AI.
While this generative AI approach has produced impressive results, it heavily leans on human supervision.
This strong reliance on human oversight poses a significant hurdle to the advancement of AI innovation.
We propose a novel paradigm termed Exploratory AI (EAI) aimed at autonomously generating high-quality training data.
arXiv Detail & Related papers (2023-10-13T07:03:39Z) - Intent-aligned AI systems deplete human agency: the need for agency
foundations research in AI safety [2.3572498744567127]
We argue that alignment to human intent is insufficient for safe AI systems.
We argue that preservation of long-term agency of humans may be a more robust standard.
arXiv Detail & Related papers (2023-05-30T17:14:01Z) - Fairness in AI and Its Long-Term Implications on Society [68.8204255655161]
We take a closer look at AI fairness and analyze how lack of AI fairness can lead to deepening of biases over time.
We discuss how biased models can lead to more negative real-world outcomes for certain groups.
If the issues persist, they could be reinforced by interactions with other risks and have severe implications on society in the form of social unrest.
arXiv Detail & Related papers (2023-04-16T11:22:59Z) - A Review of the Role of Causality in Developing Trustworthy AI Systems [16.267806768096026]
State-of-the-art AI models largely lack an understanding of the cause-effect relationship that governs human understanding of the real world.
Recently, causal modeling and inference methods have emerged as powerful tools to improve the trustworthiness aspects of AI models.
arXiv Detail & Related papers (2023-02-14T11:08:26Z) - When to Make Exceptions: Exploring Language Models as Accounts of Human
Moral Judgment [96.77970239683475]
AI systems need to be able to understand, interpret and predict human moral judgments and decisions.
A central challenge for AI safety is capturing the flexibility of the human moral mind.
We present a novel challenge set consisting of rule-breaking question answering.
arXiv Detail & Related papers (2022-10-04T09:04:27Z) - Modeling Human Behavior Part I -- Learning and Belief Approaches [0.0]
We focus on techniques which learn a model or policy of behavior through exploration and feedback.
Next generation autonomous and adaptive systems will largely include AI agents and humans working together as teams.
arXiv Detail & Related papers (2022-05-13T07:33:49Z) - Best-Response Bayesian Reinforcement Learning with Bayes-adaptive POMDPs
for Centaurs [22.52332536886295]
We present a novel formulation of the interaction between the human and the AI as a sequential game.
We show that in this case the AI's problem of helping bounded-rational humans make better decisions reduces to a Bayes-adaptive POMDP.
We discuss ways in which the machine can learn to improve upon its own limitations as well with the help of the human.
arXiv Detail & Related papers (2022-04-03T21:00:51Z) - The Response Shift Paradigm to Quantify Human Trust in AI
Recommendations [6.652641137999891]
Explainability, interpretability and how much they affect human trust in AI systems are ultimately problems of human cognition as much as machine learning.
We developed and validated a general purpose Human-AI interaction paradigm which quantifies the impact of AI recommendations on human decisions.
Our proof-of-principle paradigm allows one to quantitatively compare the rapidly growing set of XAI/IAI approaches in terms of their effect on the end-user.
arXiv Detail & Related papers (2022-02-16T22:02:09Z) - Uncalibrated Models Can Improve Human-AI Collaboration [10.106324182884068]
We show that presenting AI models as more confident than they actually are can improve human-AI performance.
We first learn a model for how humans incorporate AI advice using data from thousands of human interactions.
arXiv Detail & Related papers (2022-02-12T04:51:00Z) - Cybertrust: From Explainable to Actionable and Interpretable AI (AI2) [58.981120701284816]
Actionable and Interpretable AI (AI2) will incorporate explicit quantifications and visualizations of user confidence in AI recommendations.
It will allow examining and testing of AI system predictions to establish a basis for trust in the systems' decision making.
arXiv Detail & Related papers (2022-01-26T18:53:09Z) - Trustworthy AI: A Computational Perspective [54.80482955088197]
We focus on six of the most crucial dimensions in achieving trustworthy AI: (i) Safety & Robustness, (ii) Non-discrimination & Fairness, (iii) Explainability, (iv) Privacy, (v) Accountability & Auditability, and (vi) Environmental Well-Being.
For each dimension, we review the recent related technologies according to a taxonomy and summarize their applications in real-world systems.
arXiv Detail & Related papers (2021-07-12T14:21:46Z) - Effect of Confidence and Explanation on Accuracy and Trust Calibration
in AI-Assisted Decision Making [53.62514158534574]
We study whether features that reveal case-specific model information can calibrate trust and improve the joint performance of the human and AI.
We show that confidence score can help calibrate people's trust in an AI model, but trust calibration alone is not sufficient to improve AI-assisted decision making.
arXiv Detail & Related papers (2020-01-07T15:33:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.