Related papers: Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?

Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?

URL: http://arxiv.org/abs/2510.25471v1
Date: Wed, 29 Oct 2025 12:47:15 GMT
Title: Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?
Authors: Willem Fourie,
Abstract summary: In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems.<n>These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims.<n>This article proposes an alternative framing: that instrumental goals may be understood as features to be accepted and managed rather than failures to be limited.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In artificial intelligence (AI) alignment research, instrumental goals, also called instrumental subgoals or instrumental convergent goals, are widely associated with advanced AI systems. These goals, which include tendencies such as power-seeking and self-preservation, become problematic when they conflict with human aims. Conventional alignment theory treats instrumental goals as sources of risk that become problematic through failure modes such as reward hacking or goal misgeneralization, and attempts to limit the symptoms of instrumental goals, notably resource acquisition and self-preservation. This article proposes an alternative framing: that a philosophical argument can be constructed according to which instrumental goals may be understood as features to be accepted and managed rather than failures to be limited. Drawing on Aristotle's ontology and its modern interpretations, an ontology of concrete, goal-directed entities, it argues that advanced AI systems can be seen as artifacts whose formal and material constitution gives rise to effects distinct from their designers' intentions. In this view, the instrumental tendencies of such systems correspond to per se outcomes of their constitution rather than accidental malfunctions. The implication is that efforts should focus less on eliminating instrumental goals and more on understanding, managing, and directing them toward human-aligned ends.

Related papers

Mitigating loss of control in advanced AI systems through instrumental goal trajectories [0.0]
We develop instrumental goal trajectories to expand options beyond the model.<n>We label these pathways the procurement, governance and finance instrumental goal trajectories (IGTs)<n>IGTs offer concrete avenues for defining capability levels and for broadening how corrigibility and interruptibility are implemented.
arXiv Detail & Related papers (2026-02-02T06:13:21Z)
Institutional AI: A Governance Framework for Distributional AGI Safety [1.3763052684269788]
We identify three structural problems that emerge from core properties of AI models.<n>The solution is Institutional AI, a system-level approach that treats alignment as a question of effective governance of AI agent collectives.
arXiv Detail & Related papers (2026-01-15T17:08:26Z)
AI Deception: Risks, Dynamics, and Controls [153.71048309527225]
This project provides a comprehensive and up-to-date overview of the AI deception field.<n>We identify a formal definition of AI deception, grounded in signaling theory from studies of animal deception.<n>We organize the landscape of AI deception research as a deception cycle, consisting of two key components: deception emergence and deception treatment.
arXiv Detail & Related papers (2025-11-27T16:56:04Z)
Language Models Identify Ambiguities and Exploit Loopholes [67.74087963315213]
We study the responses of large language models (LLMs) to loopholes.<n>We find that models which exploit loopholes explicitly identify and reason about both ambiguity and conflicting goals.
arXiv Detail & Related papers (2025-08-27T03:40:17Z)
Stochastic, Dynamic, Fluid Autonomy in Agentic AI: Implications for Authorship, Inventorship, and Liability [0.2209921757303168]
Agentic AI systems autonomously pursue goals, adapting strategies through implicit learning.<n>Human and machine contributions become irreducibly entangled in intertwined creative processes.<n>We argue that legal and policy frameworks may need to treat human and machine contributions as functionally equivalent.
arXiv Detail & Related papers (2025-04-05T04:44:59Z)
AI Automatons: AI Systems Intended to Imitate Humans [54.19152688545896]
There is a growing proliferation of AI systems designed to mimic people's behavior, work, abilities, likenesses, or humanness.<n>The research, design, deployment, and availability of such AI systems have prompted growing concerns about a wide range of possible legal, ethical, and other social impacts.
arXiv Detail & Related papers (2025-03-04T03:55:38Z)
Using AI Alignment Theory to understand the potential pitfalls of regulatory frameworks [55.2480439325792]
This paper critically examines the European Union's Artificial Intelligence Act (EU AI Act) Uses insights from Alignment Theory (AT) research, which focuses on the potential pitfalls of technical alignment in Artificial Intelligence. As we apply these concepts to the EU AI Act, we uncover potential vulnerabilities and areas for improvement in the regulation.
arXiv Detail & Related papers (2024-10-10T17:38:38Z)
Towards Measuring Goal-Directedness in AI Systems [0.0]
A key prerequisite for AI systems pursuing unintended goals is whether they will behave in a coherent and goal-directed manner. We propose a new family of definitions of the goal-directedness of a policy that analyze whether it is well-modeled as near-optimal for many reward functions. Our contribution is a definition of goal-directedness that is simpler and more easily computable in order to approach the question of whether AI systems could pursue dangerous goals.
arXiv Detail & Related papers (2024-10-07T01:34:42Z)
Position: Towards Bidirectional Human-AI Alignment [109.57781720848669]
We argue that the research community should explicitly define and critically reflect on "alignment" to account for the bidirectional and dynamic relationship between humans and AI.<n>We introduce the Bidirectional Human-AI Alignment framework, which not only incorporates traditional efforts to align AI with human values but also introduces the critical, underexplored dimension of aligning humans with AI.
arXiv Detail & Related papers (2024-06-13T16:03:25Z)
The Reasons that Agents Act: Intention and Instrumental Goals [24.607124467778036]
There is no universally accepted theory of intention applicable to AI agents. We operationalise the intention with which an agent acts, relating to the reasons it chooses its decision. Our definition captures the intuitive notion of intent and satisfies desiderata set-out by past work.
arXiv Detail & Related papers (2024-02-11T14:39:40Z)
AI Alignment: A Comprehensive Survey [69.61425542486275]
AI alignment aims to make AI systems behave in line with human intentions and values.<n>We identify four principles as the key objectives of AI alignment: Robustness, Interpretability, Controllability, and Ethicality.<n>We decompose current alignment research into two key components: forward alignment and backward alignment.
arXiv Detail & Related papers (2023-10-30T15:52:15Z)
Building Bridges: Generative Artworks to Explore AI Ethics [56.058588908294446]
In recent years, there has been an increased emphasis on understanding and mitigating adverse impacts of artificial intelligence (AI) technologies on society. A significant challenge in the design of ethical AI systems is that there are multiple stakeholders in the AI pipeline, each with their own set of constraints and interests. This position paper outlines some potential ways in which generative artworks can play this role by serving as accessible and powerful educational tools.
arXiv Detail & Related papers (2021-06-25T22:31:55Z)
An interdisciplinary conceptual study of Artificial Intelligence (AI) for helping benefit-risk assessment practices: Towards a comprehensive qualification matrix of AI programs and devices (pre-print 2020) [55.41644538483948]
This paper proposes a comprehensive analysis of existing concepts coming from different disciplines tackling the notion of intelligence. The aim is to identify shared notions or discrepancies to consider for qualifying AI systems.
arXiv Detail & Related papers (2021-05-07T12:01:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.