Measuring AI agent autonomy: Towards a scalable approach with code inspection
- URL: http://arxiv.org/abs/2502.15212v1
- Date: Fri, 21 Feb 2025 04:58:40 GMT
- Title: Measuring AI agent autonomy: Towards a scalable approach with code inspection
- Authors: Peter Cihon, Merlin Stein, Gagan Bansal, Sam Manning, Kevin Xu,
- Abstract summary: We introduce a code-based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks.<n>We demonstrate this approach with the AutoGen framework and select applications.
- Score: 8.344207672507334
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: AI agents are AI systems that can achieve complex goals autonomously. Assessing the level of agent autonomy is crucial for understanding both their potential benefits and risks. Current assessments of autonomy often focus on specific risks and rely on run-time evaluations -- observations of agent actions during operation. We introduce a code-based assessment of autonomy that eliminates the need to run an AI agent to perform specific tasks, thereby reducing the costs and risks associated with run-time evaluations. Using this code-based framework, the orchestration code used to run an AI agent can be scored according to a taxonomy that assesses attributes of autonomy: impact and oversight. We demonstrate this approach with the AutoGen framework and select applications.
Related papers
- Advancing Responsible Innovation in Agentic AI: A study of Ethical Frameworks for Household Automation [1.6766200616088744]
This article analyzes agentic AI and its applications, focusing on its move from reactive to proactive autonomy, privacy, fairness and user control.<n>Vulnerable user groups such as elderly individuals, children, and neurodivergent who face higher risks of surveillance, bias, and privacy risks were studied.<n>Design imperatives are highlighted such as tailored explainability, granular consent mechanisms, and robust override controls.
arXiv Detail & Related papers (2025-07-21T06:10:02Z) - Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems [1.9751175705897066]
Large Language Models (LLMs) are increasingly deployed within agentic systems-collections of interacting, LLM-powered agents that execute complex, adaptive using memory, tools, and dynamic planning.<n>Traditional software observability and operations practices fall short in addressing these challenges.<n>This paper introduces AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems.
arXiv Detail & Related papers (2025-07-15T12:54:43Z) - Levels of Autonomy for AI Agents [9.324309359500198]
We argue that an agent's level of autonomy can be treated as a deliberate design decision, separate from its capability and operational environment.<n>We define five levels of escalating agent autonomy, characterized by the roles a user can take when interacting with an agent.<n>We highlight a potential application of our framework towards AI autonomy certificates to govern agent behavior in single- and multi-agent systems.
arXiv Detail & Related papers (2025-06-14T12:14:36Z) - Robot-Gated Interactive Imitation Learning with Adaptive Intervention Mechanism [48.41735416075536]
Interactive Imitation Learning (IIL) allows agents to acquire desired behaviors through human interventions.<n>We propose the Adaptive Intervention Mechanism (AIM), a novel robot-gated IIL algorithm that learns an adaptive criterion for requesting human demonstrations.
arXiv Detail & Related papers (2025-06-10T18:43:26Z) - TRiSM for Agentic AI: A Review of Trust, Risk, and Security Management in LLM-based Agentic Multi-Agent Systems [2.462408812529728]
This review presents a structured analysis of textbfTrust, Risk, and Security Management (TRiSM) in the context of LLM-based Agentic Multi-Agent Systems (AMAS)<n>We begin by examining the conceptual foundations of Agentic AI and highlight its architectural distinctions from traditional AI agents.<n>We then adapt and extend the AI TRiSM framework for Agentic AI, structured around four key pillars: Explainability, ModelOps, Security, Privacy and Governance.
arXiv Detail & Related papers (2025-06-04T16:26:11Z) - Threat Modeling for AI: The Case for an Asset-Centric Approach [0.23408308015481666]
AI systems now able to autonomously execute code, interact with external systems, and operate without human oversight.<n>With AI systems now able to autonomously execute code, interact with external systems, and operate without human oversight, traditional security approaches fall short.<n>This paper introduces an asset-centric methodology for threat modeling AI systems.
arXiv Detail & Related papers (2025-05-08T18:57:08Z) - Agentic Knowledgeable Self-awareness [79.25908923383776]
KnowSelf is a data-centric approach that applies agents with knowledgeable self-awareness like humans.
Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge.
arXiv Detail & Related papers (2025-04-04T16:03:38Z) - AgentDAM: Privacy Leakage Evaluation for Autonomous Web Agents [75.85554113398626]
We develop a benchmark called AgentDAM to evaluate how well existing and future AI agents can limit processing of potentially private information.
Our benchmark simulates realistic web interaction scenarios and is adaptable to all existing web navigation agents.
arXiv Detail & Related papers (2025-03-12T19:30:31Z) - Fully Autonomous AI Agents Should Not be Developed [58.88624302082713]
This paper argues that fully autonomous AI agents should not be developed.<n>In support of this position, we build from prior scientific literature and current product marketing to delineate different AI agent levels.<n>Our analysis reveals that risks to people increase with the autonomy of a system.
arXiv Detail & Related papers (2025-02-04T19:00:06Z) - Agentic AI: Autonomy, Accountability, and the Algorithmic Society [0.2209921757303168]
Agentic Artificial Intelligence (AI) can autonomously pursue long-term goals, make decisions, and execute complex, multi-turn.<n>This transition from advisory roles to proactive execution challenges established legal, economic, and creative frameworks.<n>We explore challenges in three interrelated domains: creativity and intellectual property, legal and ethical considerations, and competitive effects.
arXiv Detail & Related papers (2025-02-01T03:14:59Z) - MISR: Measuring Instrumental Self-Reasoning in Frontier Models [7.414638276983446]
We evaluate the instrumental self-reasoning ability of large language model (LLM) agents.
We find that instrumental self-reasoning ability emerges only in the most capable frontier models.
Our evaluations can be used to measure increases in instrumental self-reasoning ability in future models.
arXiv Detail & Related papers (2024-12-05T06:20:47Z) - Engineering Trustworthy AI: A Developer Guide for Empirical Risk Minimization [53.80919781981027]
Key requirements for trustworthy AI can be translated into design choices for the components of empirical risk minimization.
We hope to provide actionable guidance for building AI systems that meet emerging standards for trustworthiness of AI.
arXiv Detail & Related papers (2024-10-25T07:53:32Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - "A Good Bot Always Knows Its Limitations": Assessing Autonomous System Decision-making Competencies through Factorized Machine Self-confidence [5.167803438665586]
This paper presents the Factorized Machine Self-confidence (FaMSeC) framework, which holistically considers several major factors driving competency in algorithmic decision-making.
In FaMSeC, self-confidence indicators are derived via 'problem-solving statistics' embedded in Markov decision process solvers.
We include detailed descriptions and examples for Markov decision process agents, and show how outcome assessment and solver quality factors can be found for a range of tasking contexts.
arXiv Detail & Related papers (2024-07-29T01:22:04Z) - Training Compute Thresholds: Features and Functions in AI Regulation [0.7234862895932991]
Regulators in the US and EU are using thresholds based on training compute to identify GPAI models that may pose risks of large-scale societal harm.
We argue that training compute currently is the most suitable metric to identify GPAI models that deserve regulatory oversight and further scrutiny.
As GPAI technology and market structures evolve, regulators should update compute thresholds and complement them with other metrics into regulatory review processes.
arXiv Detail & Related papers (2024-05-17T14:10:24Z) - Visibility into AI Agents [9.067567737098594]
Increased delegation of commercial, scientific, governmental, and personal activities to AI agents may exacerbate existing societal risks.
We assess three categories of measures to increase visibility into AI agents: agent identifiers, real-time monitoring, and activity logging.
arXiv Detail & Related papers (2024-01-23T23:18:33Z) - Interactive Autonomous Navigation with Internal State Inference and
Interactivity Estimation [58.21683603243387]
We propose three auxiliary tasks with relational-temporal reasoning and integrate them into the standard Deep Learning framework.
These auxiliary tasks provide additional supervision signals to infer the behavior patterns other interactive agents.
Our approach achieves robust and state-of-the-art performance in terms of standard evaluation metrics.
arXiv Detail & Related papers (2023-11-27T18:57:42Z) - Modelling Assessment Rubrics through Bayesian Networks: a Pragmatic Approach [40.06500618820166]
This paper presents an approach to deriving a learner model directly from an assessment rubric.
We illustrate how the approach can be applied to automatize the human assessment of an activity developed for testing computational thinking skills.
arXiv Detail & Related papers (2022-09-07T10:09:12Z) - Differential Assessment of Black-Box AI Agents [29.98710357871698]
We propose a novel approach to differentially assess black-box AI agents that have drifted from their previously known models.
We leverage sparse observations of the drifted agent's current behavior and knowledge of its initial model to generate an active querying policy.
Empirical evaluation shows that our approach is much more efficient than re-learning the agent model from scratch.
arXiv Detail & Related papers (2022-03-24T17:48:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.