Related papers: Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems

URL: http://arxiv.org/abs/2509.00115v3
Date: Sat, 13 Sep 2025 03:25:07 GMT
Title: Adaptive Monitoring and Real-World Evaluation of Agentic AI Systems
Authors: Manish Shukla,
Abstract summary: Multi-agent systems that combine large language models with external tools are rapidly transitioning from research laboratories into high-stakes domains.<n>This "Advanced" sequel fills that gap by providing an algorithmic instantiation or empirical evidence.<n>AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9%.
Score: 3.215065407261898
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic artificial intelligence (AI) -- multi-agent systems that combine large language models with external tools and autonomous planning -- are rapidly transitioning from research laboratories into high-stakes domains. Our earlier "Basic" paper introduced a five-axis framework and proposed preliminary metrics such as goal drift and harm reduction but did not provide an algorithmic instantiation or empirical evidence. This "Advanced" sequel fills that gap. First, we revisit recent benchmarks and industrial deployments to show that technical metrics still dominate evaluations: a systematic review of 84 papers from 2023--2025 found that 83% report capability metrics while only 30% consider human-centred or economic axes [2]. Second, we formalise an Adaptive Multi-Dimensional Monitoring (AMDM) algorithm that normalises heterogeneous metrics, applies per-axis exponentially weighted moving-average thresholds and performs joint anomaly detection via the Mahalanobis distance [7]. Third, we conduct simulations and real-world experiments. AMDM cuts anomaly-detection latency from 12.3 s to 5.6 s on simulated goal drift and reduces false-positive rates from 4.5% to 0.9% compared with static thresholds. We present a comparison table and ROC/PR curves, and we reanalyse case studies to surface missing metrics. Code, data and a reproducibility checklist accompany this paper to facilitate replication. The code supporting this work is available at https://github.com/Manishms18/Adaptive-Multi-Dimensional-Monitoring.

Related papers

Temporal Attack Pattern Detection in Multi-Agent AI Workflows: An Open Framework for Training Trace-Based Security Models [0.0]
We present an openly documented methodology for fine-tuning language models to detect temporal attack patterns in multi-agent AI.<n>We curate a dataset of 80,851 examples from 18 public cybersecurity sources and 35,026 synthetic OpenTelemetry traces.<n>Our custom benchmark accuracy improves from 42.86% to 74.29%, a statistically significant 31.4-point gain.
arXiv Detail & Related papers (2025-12-29T09:41:22Z)
Towards a Science of Scaling Agent Systems [79.64446272302287]
We formalize a definition for agent evaluation and characterize scaling laws as the interplay between agent quantity, coordination structure, modelic, and task properties.<n>We derive a predictive model using coordination metrics, that cross-validated R2=0, enabling prediction on unseen task domains.<n>We identify three effects: (1) a tool-coordination trade-off: under fixed computational budgets, tool-heavy tasks suffer disproportionately from multi-agent overhead, and (2) a capability saturation: coordination yields diminishing or negative returns once single-agent baselines exceed 45%.
arXiv Detail & Related papers (2025-12-09T06:52:21Z)
Explainable Deep Learning for Brain Tumor Classification: Comprehensive Benchmarking with Dual Interpretability and Lightweight Deployment [4.259927630334864]
This study provides a full deep learning system for automated classification of brain tumors from MRI images.<n>Inception-ResNet V2 reached state-of-the-art performance, achieving a 99.53% accuracy on testing.<n>This end-to-end solution considers accuracy, interpretability, and deployability of trustworthy AI.
arXiv Detail & Related papers (2025-11-20T17:21:40Z)
Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z)
ShortcutBreaker: Low-Rank Noisy Bottleneck with Global Perturbation Attention for Multi-Class Unsupervised Anomaly Detection [59.89803740308262]
ShortcutBreaker is a novel unified feature-reconstruction framework for MUAD tasks.<n>It features two key innovations to address the issue of shortcuts.<n>The proposed method achieves a remarkable image-level AUROC of 99.8%, 98.9%, 90.6%, and 87.8% on four datasets.
arXiv Detail & Related papers (2025-10-21T06:51:30Z)
Systematic Review: Anomaly Detection in Connected and Autonomous Vehicles [0.0]
This systematic review focuses on anomaly detection for connected and autonomous vehicles. The most commonly used Artificial Intelligence (AI) algorithms employed in anomaly detection are neural networks like LSTM, CNN, and autoencoders, alongside one-class SVM. There is a need for future research to investigate the deployment of anomaly detection to a vehicle to assess its performance on the road.
arXiv Detail & Related papers (2024-05-04T18:31:38Z)
Anomaly Detection for Incident Response at Scale [1.284857579394658]
We present a machine learning-based anomaly detection product that monitors Walmart's business and system health in real-time. During the validation over 3 months, the product served predictions from over 3000 models to more than 25 application, platform, and operation teams. AIDR has achieved success with various internal teams with lower time to detection and fewer false positives than previous methods.
arXiv Detail & Related papers (2024-04-24T00:46:19Z)
Self-supervised Feature Adaptation for 3D Industrial Anomaly Detection [59.41026558455904]
We focus on multi-modal anomaly detection. Specifically, we investigate early multi-modal approaches that attempted to utilize models pre-trained on large-scale visual datasets. We propose a Local-to-global Self-supervised Feature Adaptation (LSFA) method to finetune the adaptors and learn task-oriented representation toward anomaly detection.
arXiv Detail & Related papers (2024-01-06T07:30:41Z)
Tool-Augmented Reward Modeling [58.381678612409]
We propose a tool-augmented preference modeling approach, named Themis, to address limitations by empowering RMs with access to external environments. Our study delves into the integration of external tools into RMs, enabling them to interact with diverse external sources. In human evaluations, RLHF trained with Themis attains an average win rate of 32% when compared to baselines.
arXiv Detail & Related papers (2023-10-02T09:47:40Z)
A Study of Unsupervised Evaluation Metrics for Practical and Automatic Domain Adaptation [15.728090002818963]
Unsupervised domain adaptation (UDA) methods facilitate the transfer of models to target domains without labels. In this paper, we aim to find an evaluation metric capable of assessing the quality of a transferred model without access to target validation labels.
arXiv Detail & Related papers (2023-08-01T05:01:05Z)
Inertial Hallucinations -- When Wearable Inertial Devices Start Seeing Things [82.15959827765325]
We propose a novel approach to multimodal sensor fusion for Ambient Assisted Living (AAL) We address two major shortcomings of standard multimodal approaches, limited area coverage and reduced reliability. Our new framework fuses the concept of modality hallucination with triplet learning to train a model with different modalities to handle missing sensors at inference time.
arXiv Detail & Related papers (2022-07-14T10:04:18Z)
RLAD: Time Series Anomaly Detection through Reinforcement Learning and Active Learning [17.089402177923297]
We introduce a new semi-supervised, time series anomaly detection algorithm. It uses deep reinforcement learning and active learning to efficiently learn and adapt to anomalies in real-world time series data. It requires no manual tuning of parameters and outperforms all state-of-art methods we compare with.
arXiv Detail & Related papers (2021-03-31T15:21:15Z)
Inception Convolution with Efficient Dilation Search [121.41030859447487]
Dilation convolution is a critical mutant of standard convolution neural network to control effective receptive fields and handle large scale variance of objects. We propose a new mutant of dilated convolution, namely inception (dilated) convolution where the convolutions have independent dilation among different axes, channels and layers. We explore a practical method for fitting the complex inception convolution to the data, a simple while effective dilation search algorithm(EDO) based on statistical optimization is developed.
arXiv Detail & Related papers (2020-12-25T14:58:35Z)
SUOD: Accelerating Large-Scale Unsupervised Heterogeneous Outlier Detection [63.253850875265115]
Outlier detection (OD) is a key machine learning (ML) task for identifying abnormal objects from general samples. We propose a modular acceleration system, called SUOD, to address it.
arXiv Detail & Related papers (2020-03-11T00:22:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.