Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems
- URL: http://arxiv.org/abs/2507.11277v1
- Date: Tue, 15 Jul 2025 12:54:43 GMT
- Title: Taming Uncertainty via Automation: Observing, Analyzing, and Optimizing Agentic AI Systems
- Authors: Dany Moshkovich, Sergey Zeltyn,
- Abstract summary: Large Language Models (LLMs) are increasingly deployed within agentic systems-collections of interacting, LLM-powered agents that execute complex, adaptive using memory, tools, and dynamic planning.<n>Traditional software observability and operations practices fall short in addressing these challenges.<n>This paper introduces AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems.
- Score: 1.9751175705897066
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) are increasingly deployed within agentic systems-collections of interacting, LLM-powered agents that execute complex, adaptive workflows using memory, tools, and dynamic planning. While enabling powerful new capabilities, these systems also introduce unique forms of uncertainty stemming from probabilistic reasoning, evolving memory states, and fluid execution paths. Traditional software observability and operations practices fall short in addressing these challenges. This paper introduces AgentOps: a comprehensive framework for observing, analyzing, optimizing, and automating operation of agentic AI systems. We identify distinct needs across four key roles-developers, testers, site reliability engineers (SREs), and business users-each of whom engages with the system at different points in its lifecycle. We present the AgentOps Automation Pipeline, a six-stage process encompassing behavior observation, metric collection, issue detection, root cause analysis, optimized recommendations, and runtime automation. Throughout, we emphasize the critical role of automation in managing uncertainty and enabling self-improving AI systems-not by eliminating uncertainty, but by taming it to ensure safe, adaptive, and effective operation.
Related papers
- A Survey on Agent Workflow -- Status and Future [2.817843718857682]
This survey provides a comprehensive review of agent workflow systems.<n>We classify existing systems along two key dimensions: functional capabilities and architectural features.<n>We highlight common patterns, potential technical challenges, and emerging trends.
arXiv Detail & Related papers (2025-08-02T04:15:30Z) - SafeMobile: Chain-level Jailbreak Detection and Automated Evaluation for Multimodal Mobile Agents [58.21223208538351]
This work explores the security issues surrounding mobile multimodal agents.<n>It attempts to construct a risk discrimination mechanism by incorporating behavioral sequence information.<n>It also designs an automated assisted assessment scheme based on a large language model.
arXiv Detail & Related papers (2025-07-01T15:10:00Z) - SV-LLM: An Agentic Approach for SoC Security Verification using Large Language Models [8.912091484067508]
We introduce SV-LLM, a novel multi-agent assistant system designed to automate and enhance system-on-chip (SoC) security verification.<n>By integrating specialized agents for tasks like verification question answering, security asset identification, threat modeling, test plan and property generation, vulnerability detection, and simulation-based bug validation, SV-LLM streamlines the workflow.<n>The system aims to reduce manual intervention, improve accuracy, and accelerate security analysis, supporting proactive identification and mitigation of risks early in the design cycle.
arXiv Detail & Related papers (2025-06-25T13:31:13Z) - AssetOpsBench: Benchmarking AI Agents for Task Automation in Industrial Asset Operations and Maintenance [7.110126223593506]
This paper envisions a future where AI agents autonomously manage tasks that previously required distinct expertise and manual coordination.<n>We introduce AssetOpsBench -- a unified framework and environment designed to guide the development, orchestration, and evaluation of domain-specific agents.<n>We outline the key requirements for such holistic systems and provide actionable insights into building agents that integrate perception, reasoning, and control for real-world industrial operations.
arXiv Detail & Related papers (2025-06-04T10:57:35Z) - Automating Safety Enhancement for LLM-based Agents with Synthetic Risk Scenarios [77.86600052899156]
Large Language Model (LLM)-based agents are increasingly deployed in real-world applications.<n>We propose AutoSafe, the first framework that systematically enhances agent safety through fully automated synthetic data generation.<n>We show that AutoSafe boosts safety scores by 45% on average and achieves a 28.91% improvement on real-world tasks.
arXiv Detail & Related papers (2025-05-23T10:56:06Z) - Interactive Agents to Overcome Ambiguity in Software Engineering [61.40183840499932]
AI agents are increasingly being deployed to automate tasks, often based on ambiguous and underspecified user instructions.<n>Making unwarranted assumptions and failing to ask clarifying questions can lead to suboptimal outcomes.<n>We study the ability of LLM agents to handle ambiguous instructions in interactive code generation settings by evaluating proprietary and open-weight models on their performance.
arXiv Detail & Related papers (2025-02-18T17:12:26Z) - AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds [12.464941027105306]
AI for IT Operations (AIOps) aims to automate complex operational tasks, such as fault localization and root cause analysis, to reduce human workload and minimize customer impact.<n>Recent advances in Large Language Models (LLMs) and AI agents are revolutionizing AIOps by enabling end-to-end and multitask automation.<n>We present AIOPSLAB, a framework that deploys microservice cloud environments, injects faults, generates workloads, and exports telemetry data but also orchestrates these components and provides interfaces for interacting with and evaluating agents.
arXiv Detail & Related papers (2025-01-12T04:17:39Z) - A Multi-AI Agent System for Autonomous Optimization of Agentic AI Solutions via Iterative Refinement and LLM-Driven Feedback Loops [3.729242965449096]
This paper introduces a framework for autonomously optimizing Agentic AI solutions across industries.<n>The framework achieves optimal performance without human input by autonomously generating and testing hypotheses.<n>Case studies show significant improvements in output quality, relevance, and actionability.
arXiv Detail & Related papers (2024-12-22T20:08:04Z) - AgentOps: Enabling Observability of LLM Agents [12.49728300301026]
Large language model (LLM) agents raise significant concerns on AI safety due to their autonomous and non-deterministic behavior.<n>We present a comprehensive taxonomy of AgentOps, identifying the artifacts and associated data that should be traced throughout the entire lifecycle of agents to achieve effective observability.<n>Our taxonomy serves as a reference template for developers to design and implement AgentOps infrastructure that supports monitoring, logging, and analytics.
arXiv Detail & Related papers (2024-11-08T02:31:03Z) - Agent-as-a-Judge: Evaluate Agents with Agents [61.33974108405561]
We introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems.
This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process.
We present DevAI, a new benchmark of 55 realistic automated AI development tasks.
arXiv Detail & Related papers (2024-10-14T17:57:02Z) - The Foundations of Computational Management: A Systematic Approach to
Task Automation for the Integration of Artificial Intelligence into Existing
Workflows [55.2480439325792]
This article introduces Computational Management, a systematic approach to task automation.
The article offers three easy step-by-step procedures to begin the process of implementing AI within a workflow.
arXiv Detail & Related papers (2024-02-07T01:45:14Z) - TaskBench: Benchmarking Large Language Models for Task Automation [82.2932794189585]
We introduce TaskBench, a framework to evaluate the capability of large language models (LLMs) in task automation.
Specifically, task decomposition, tool selection, and parameter prediction are assessed.
Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation.
arXiv Detail & Related papers (2023-11-30T18:02:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.