Related papers: The Limits of Predicting Agents from Behaviour

The Limits of Predicting Agents from Behaviour

URL: http://arxiv.org/abs/2506.02923v1
Date: Tue, 03 Jun 2025 14:24:58 GMT
Title: The Limits of Predicting Agents from Behaviour
Authors: Alexis Bellot, Jonathan Richens, Tom Everitt,
Abstract summary: We provide a precise answer under the assumption that the agent's behaviour is guided by a world model.<n>Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments.<n>We discuss the implications of these results for several research areas including fairness and safety.
Score: 16.80911584745046
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As the complexity of AI systems and their interactions with the world increases, generating explanations for their behaviour is important for safely deploying AI. For agents, the most natural abstractions for predicting behaviour attribute beliefs, intentions and goals to the system. If an agent behaves as if it has a certain goal or belief, then we can make reasonable predictions about how it will behave in novel situations, including those where comprehensive safety evaluations are untenable. How well can we infer an agent's beliefs from their behaviour, and how reliably can these inferred beliefs predict the agent's behaviour in novel situations? We provide a precise answer to this question under the assumption that the agent's behaviour is guided by a world model. Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments, which represent a theoretical limit for predicting intentional agents from behavioural data alone. We discuss the implications of these results for several research areas including fairness and safety.

Related papers

Model Editing as a Double-Edged Sword: Steering Agent Ethical Behavior Toward Beneficence or Harm [57.00627691433355]
We frame agent behavior steering as a model editing task, which we term Behavior Editing.<n>We introduce BehaviorBench, a benchmark grounded in psychological moral theories.<n>We demonstrate that Behavior Editing can be used to promote ethical and benevolent behavior or, conversely, to induce harmful or malicious behavior.
arXiv Detail & Related papers (2025-06-25T16:51:51Z)
Safe Explicable Policy Search [3.3869539907606603]
We present Safe Explicable Policy Search (SEPS), which aims to provide a learning approach to explicable behavior generation while minimizing the safety risk.<n>We formulate SEPS as a constrained optimization problem where the agent aims to maximize an explicability score subject to constraints on safety.<n>We evaluate SEPS in safety-gym environments and with a physical robot experiment to show that it can learn explicable behaviors that adhere to the agent's safety requirements and are efficient.
arXiv Detail & Related papers (2025-03-10T20:52:41Z)
Intention-aware policy graphs: answering what, how, and why in opaque agents [0.1398098625978622]
Agents are a special kind of AI-based software in that they interact in complex environments and have increased potential for emergent behaviour. We propose a Probabilistic Graphical Model along with a pipeline for designing such model. We contribute measurements that evaluate the interpretability and reliability of explanations provided. This model can be constructed by taking partial observations of the agent's actions and world states.
arXiv Detail & Related papers (2024-09-27T09:31:45Z)
Performative Prediction on Games and Mechanism Design [69.7933059664256]
We study a collective risk dilemma where agents decide whether to trust predictions based on past accuracy.<n>As predictions shape collective outcomes, social welfare arises naturally as a metric of concern.<n>We show how to achieve better trade-offs and use them for mechanism design.
arXiv Detail & Related papers (2024-08-09T16:03:44Z)
Select to Perfect: Imitating desired behavior from large multi-agent data [28.145889065013687]
Desired characteristics for AI agents can be expressed by assigning desirability scores. We first assess the effect of each individual agent's behavior on the collective desirability score. We propose the concept of an agent's Exchange Value, which quantifies an individual agent's contribution to the collective desirability score.
arXiv Detail & Related papers (2024-05-06T15:48:24Z)
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety [70.84902425123406]
Multi-agent systems, when enhanced with Large Language Models (LLMs), exhibit profound capabilities in collective intelligence. However, the potential misuse of this intelligence for malicious purposes presents significant risks. We propose a framework (PsySafe) grounded in agent psychology, focusing on identifying how dark personality traits in agents can lead to risky behaviors. Our experiments reveal several intriguing phenomena, such as the collective dangerous behaviors among agents, agents' self-reflection when engaging in dangerous behavior, and the correlation between agents' psychological assessments and dangerous behaviors.
arXiv Detail & Related papers (2024-01-22T12:11:55Z)
Analyzing Intentional Behavior in Autonomous Agents under Uncertainty [3.0099979365586265]
Principled accountability for autonomous decision-making in uncertain environments requires distinguishing intentional outcomes from negligent designs from actual accidents. We propose analyzing the behavior of autonomous agents through a quantitative measure of the evidence of intentional behavior. In a case study, we show how our method can distinguish between 'intentional' and 'accidental' traffic collisions.
arXiv Detail & Related papers (2023-07-04T07:36:11Z)
CAMMARL: Conformal Action Modeling in Multi Agent Reinforcement Learning [5.865719902445064]
We propose a novel multi-agent reinforcement learning algorithm CAMMARL. It involves modeling the actions of other agents in different situations in the form of confident sets. We show that CAMMARL elevates the capabilities of an autonomous agent in MARL by modeling conformal prediction sets.
arXiv Detail & Related papers (2023-06-19T19:03:53Z)
Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark [61.43264961005614]
We develop a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios. We evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. Our results show that agents can both act competently and morally, so concrete progress can be made in machine ethics.
arXiv Detail & Related papers (2023-04-06T17:59:03Z)
What Should I Know? Using Meta-gradient Descent for Predictive Feature Discovery in a Single Stream of Experience [63.75363908696257]
computational reinforcement learning seeks to construct an agent's perception of the world through predictions of future sensations. An open challenge in this line of work is determining from the infinitely many predictions that the agent could possibly make which predictions might best support decision-making. We introduce a meta-gradient descent process by which an agent learns what predictions to make, 2) the estimates for its chosen predictions, and 3) how to use those estimates to generate policies that maximize future reward.
arXiv Detail & Related papers (2022-06-13T21:31:06Z)
Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities. We additionally present PUP, a new challenging real-world autonomous driving dataset. We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z)
A Unifying Bayesian Formulation of Measures of Interpretability in Human-AI [25.239891076153025]
We present a unifying Bayesian framework that models a human observer's evolving beliefs about an agent. We show that the definitions of interpretability measures like explicability, legibility and predictability from the prior literature fall out as special cases of our general framework.
arXiv Detail & Related papers (2021-04-21T20:06:33Z)
Maximizing Information Gain in Partially Observable Environments via Prediction Reward [64.24528565312463]
This paper tackles the challenge of using belief-based rewards for a deep RL agent. We derive the exact error between negative entropy and the expected prediction reward. This insight provides theoretical motivation for several fields using prediction rewards.
arXiv Detail & Related papers (2020-05-11T08:13:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.