Related papers: Performance of Bounded-Rational Agents With the Ability to Self-Modify

Performance of Bounded-Rational Agents With the Ability to Self-Modify

URL: http://arxiv.org/abs/2011.06275v2
Date: Mon, 18 Jan 2021 09:55:26 GMT
Title: Performance of Bounded-Rational Agents With the Ability to Self-Modify
Authors: Jakub T\v{e}tek, Marek Sklenka, Tom\'a\v{s} Gaven\v{c}iak
Abstract summary: Self-modification of agents embedded in complex environments is hard to avoid. It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals. We show that this result is no longer true for agents with bounded rationality.
Score: 1.933681537640272
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Self-modification of agents embedded in complex environments is hard to avoid, whether it happens via direct means (e.g. own code modification) or indirectly (e.g. influencing the operator, exploiting bugs or the environment). It has been argued that intelligent agents have an incentive to avoid modifying their utility function so that their future instances work towards the same goals. Everitt et al. (2016) formally show that providing an option to self-modify is harmless for perfectly rational agents. We show that this result is no longer true for agents with bounded rationality. In such agents, self-modification may cause exponential deterioration in performance and gradual misalignment of a previously aligned agent. We investigate how the size of this effect depends on the type and magnitude of imperfections in the agent's rationality (1-4 below). We also discuss model assumptions and the wider problem and framing space. We examine four ways in which an agent can be bounded-rational: it either (1) doesn't always choose the optimal action, (2) is not perfectly aligned with human values, (3) has an inaccurate model of the environment, or (4) uses the wrong temporal discounting factor. We show that while in the cases (2)-(4) the misalignment caused by the agent's imperfection does not increase over time, with (1) the misalignment may grow exponentially.

Related papers

AgentMisalignment: Measuring the Propensity for Misaligned Behaviour in LLM-Based Agents [0.0]
We introduce a misalignment propensity benchmark, AgentMisalignment, consisting of a suite of realistic scenarios.<n>We organise our evaluations into subcategories of misaligned behaviours, including goal-guarding, resisting shutdown, sandbagging, and power-seeking.<n>We report the performance of frontier models on our benchmark, observing higher misalignment on average when evaluating more capable models.
arXiv Detail & Related papers (2025-06-04T14:46:47Z)
The Limits of Predicting Agents from Behaviour [16.80911584745046]
We provide a precise answer under the assumption that the agent's behaviour is guided by a world model.<n>Our contribution is the derivation of novel bounds on the agent's behaviour in new (unseen) deployment environments.<n>We discuss the implications of these results for several research areas including fairness and safety.
arXiv Detail & Related papers (2025-06-03T14:24:58Z)
Toward a Theory of Agents as Tool-Use Decision-Makers [89.26889709510242]
We argue that true autonomy requires agents to be grounded in a coherent epistemic framework that governs what they know, what they need to know, and how to acquire that knowledge efficiently.<n>We propose a unified theory that treats internal reasoning and external actions as equivalent epistemic tools, enabling agents to systematically coordinate introspection and interaction.<n>This perspective shifts the design of agents from mere action executors to knowledge-driven intelligence systems, offering a principled path toward building foundation agents capable of adaptive, efficient, and goal-directed behavior.
arXiv Detail & Related papers (2025-06-01T07:52:16Z)
Partial Identifiability in Inverse Reinforcement Learning For Agents With Non-Exponential Discounting [64.13583792391783]
inverse reinforcement learning aims to infer an agent's preferences from observing their behaviour. One of the central difficulties in IRL is that multiple preferences may lead to the same observed behaviour. We show that generally IRL is unable to infer enough information about $R$ to identify the correct optimal policy.
arXiv Detail & Related papers (2024-12-15T11:08:58Z)
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents [58.79302663733703]
Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents.<n>The impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored.<n>This paper investigates what is the resilience of various system structures under faulty agents on different downstream tasks.
arXiv Detail & Related papers (2024-08-02T03:25:20Z)
Agents Need Not Know Their Purpose [0.0]
This paper describes oblivious agents: agents architected in such a way that their effective utility function is an aggregation of hidden sub-functions. We show that an oblivious agent, behaving rationally, constructs an internal approximation of designers' intentions.
arXiv Detail & Related papers (2024-02-15T06:15:46Z)
On the Convergence of Bounded Agents [80.67035535522777]
A bounded agent has converged when the minimal number of states needed to describe the agent's future behavior cannot decrease. The second view says that a bounded agent has converged just when the agent's performance only changes if the agent's internal state changes.
arXiv Detail & Related papers (2023-07-20T17:27:29Z)
Decision-Making Among Bounded Rational Agents [5.24482648010213]
We introduce the concept of bounded rationality from an information-theoretic view into the game-theoretic framework. This allows the robots to reason other agents' sub-optimal behaviors and act accordingly under their computational constraints. We demonstrate that the resulting framework allows the robots to reason about different levels of rational behaviors of other agents and compute a reasonable strategy under its computational constraint.
arXiv Detail & Related papers (2022-10-17T00:29:24Z)
On Avoiding Power-Seeking by Artificial Intelligence [93.9264437334683]
We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether we can build smart AI agents which have limited impact on the world, and which do not autonomously seek power.
arXiv Detail & Related papers (2022-06-23T16:56:21Z)
Formalizing the Problem of Side Effect Regularization [81.97441214404247]
We propose a formal criterion for side effect regularization via the assistance game framework. In these games, the agent solves a partially observable Markov decision process. We show that this POMDP is solved by trading off the proxy reward with the agent's ability to achieve a range of future tasks.
arXiv Detail & Related papers (2022-06-23T16:36:13Z)
Heterogeneous-Agent Trajectory Forecasting Incorporating Class Uncertainty [54.88405167739227]
We present HAICU, a method for heterogeneous-agent trajectory forecasting that explicitly incorporates agents' class probabilities. We additionally present PUP, a new challenging real-world autonomous driving dataset. We demonstrate that incorporating class probabilities in trajectory forecasting significantly improves performance in the face of uncertainty.
arXiv Detail & Related papers (2021-04-26T10:28:34Z)
Empirically Verifying Hypotheses Using Reinforcement Learning [58.09414653169534]
This paper formulates hypothesis verification as an RL problem. We aim to build an agent that, given a hypothesis about the dynamics of the world, can take actions to generate observations which can help predict whether the hypothesis is true or false.
arXiv Detail & Related papers (2020-06-29T01:01:10Z)
Pessimism About Unknown Unknowns Inspires Conservatism [24.085795452335145]
We define an idealized Bayesian reinforcement learner which follows a policy that maximizes the worst-case expected reward over a set of world-models. A scalar parameter tunes the agent's pessimism by changing the size of the set of world-models taken into account. Since pessimism discourages exploration, at each timestep, the agent may defer to a mentor, who may be a human or some known-safe policy.
arXiv Detail & Related papers (2020-06-15T20:46:33Z)
Distributing entanglement with separable states: assessment of encoding and decoding imperfections [55.41644538483948]
Entanglement can be distributed using a carrier which is always separable from the rest of the systems involved. We consider the effect of incoherent dynamics acting alongside imperfect unitary interactions. We show that entanglement gain is possible even with substantial unitary errors.
arXiv Detail & Related papers (2020-02-11T15:25:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.