AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems
- URL: http://arxiv.org/abs/2601.09393v1
- Date: Wed, 14 Jan 2026 11:32:07 GMT
- Title: AI-NativeBench: An Open-Source White-Box Agentic Benchmark Suite for AI-Native Systems
- Authors: Zirui Wang, Guangba Yu, Michael R. Lyu,
- Abstract summary: We introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards.<n>By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities.<n>This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems.
- Score: 52.65695508605237
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The transition from Cloud-Native to AI-Native architectures is fundamentally reshaping software engineering, replacing deterministic microservices with probabilistic agentic services. However, this shift renders traditional black-box evaluation paradigms insufficient: existing benchmarks measure raw model capabilities while remaining blind to system-level execution dynamics. To bridge this gap, we introduce AI-NativeBench, the first application-centric and white-box AI-Native benchmark suite grounded in Model Context Protocol (MCP) and Agent-to-Agent (A2A) standards. By treating agentic spans as first-class citizens within distributed traces, our methodology enables granular analysis of engineering characteristics beyond simple capabilities. Leveraging this benchmark across 21 system variants, we uncover critical engineering realities invisible to traditional metrics: a parameter paradox where lightweight models often surpass flagships in protocol adherence, a pervasive inference dominance that renders protocol overhead secondary, and an expensive failure pattern where self-healing mechanisms paradoxically act as cost multipliers on unviable workflows. This work provides the first systematic evidence to guide the transition from measuring model capability to engineering reliable AI-Native systems. To facilitate reproducibility and further research, we have open-sourced the benchmark and dataset.
Related papers
- Engineering AI Agents for Clinical Workflows: A Case Study in Architecture,MLOps, and Governance [0.21748200848556345]
We show how a Human-in-the-Loop governance model is technically integrated not merely as a safety check, but as a critical, event-driven data source for continuous improvement.<n>We present the platform as a reference architecture, offering practical lessons for engineers building maintainable, scalable, and accountable AI-enabled systems in high-stakes domains.
arXiv Detail & Related papers (2026-01-31T14:33:57Z) - EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z) - Agentic Confidence Calibration [67.50096917021521]
Holistic Trajectory (HTC) is a novel diagnostic framework for AI agents.<n>HTC consistently surpasses strong baselines in both calibration and discrimination.<n>HTC provides interpretability by revealing the signals behind failure.
arXiv Detail & Related papers (2026-01-22T09:08:25Z) - SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z) - Creative Adversarial Testing (CAT): A Novel Framework for Evaluating Goal-Oriented Agentic AI Systems [0.0]
Creative Adversarial Testing (CAT) is a novel approach designed to capture and analyze the complex relationship between Agentic AI tasks and the system's intended objectives.<n>We validate the CAT framework through extensive simulation using synthetic interaction data modeled after Alexa+ audio services.<n>Our results demonstrate that the CAT framework provides unprecedented insights into goal-task alignment, enabling more effective optimization and development of Agentic AI systems.
arXiv Detail & Related papers (2025-09-26T23:52:20Z) - A2R: An Asymmetric Two-Stage Reasoning Framework for Parallel Reasoning [57.727084580884075]
Asymmetric Two-Stage Reasoning framework designed to bridge gap between a model's potential and its actual performance.<n>A2R-Efficient is a "small-to-big" variant that combines a Qwen3-4B explorer with a Qwen3-8B synthesizer.<n>Results show A2R is not only a performance-boosting framework but also an efficient and practical solution for real-world applications.
arXiv Detail & Related papers (2025-09-26T08:27:03Z) - White-Basilisk: A Hybrid Model for Code Vulnerability Detection [45.03594130075282]
We introduce White-Basilisk, a novel approach to vulnerability detection that demonstrates superior performance.<n>White-Basilisk achieves results in vulnerability detection tasks with a parameter count of only 200M.<n>This research establishes new benchmarks in code security and provides empirical evidence that compact, efficiently designed models can outperform larger counterparts in specialized tasks.
arXiv Detail & Related papers (2025-07-11T12:39:25Z) - Robust Machine Unlearning for Quantized Neural Networks via Adaptive Gradient Reweighting with Similar Labels [5.868949328814509]
Model quantization enables efficient deployment of deep neural networks on edge devices through low-bit parameter representation.<n>Existing machine unlearning (MU) methods fail to address two fundamental limitations in quantized networks.<n>We propose Q-MUL, the first dedicated unlearning framework for quantized models.
arXiv Detail & Related papers (2025-03-18T05:22:13Z) - Joint Explainability-Performance Optimization With Surrogate Models for AI-Driven Edge Services [3.8688731303365533]
This paper explores the balance between the predictive accuracy of complex AI models and their approximation by surrogate ones.<n>We introduce a new algorithm based on multi-objective optimization (MOO) to simultaneously minimize both the complex model's prediction error and the error between its outputs and those of the surrogate.
arXiv Detail & Related papers (2025-03-10T19:04:09Z) - STAMP: Scalable Task And Model-agnostic Collaborative Perception [24.890993164334766]
STAMP is a task- and model-agnostic, collaborative perception pipeline for heterogeneous agents.<n>It minimizes computational overhead, enhances scalability, and preserves model security.<n>As a first-of-its-kind framework, STAMP aims to advance research in scalable and secure mobility systems towards Level 5 autonomy.
arXiv Detail & Related papers (2025-01-24T16:27:28Z) - Real-Time Anomaly Detection and Reactive Planning with Large Language Models [18.57162998677491]
Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot capabilities.
We present a two-stage reasoning framework that incorporates the judgement regarding potential anomalies into a safe control framework.
This enables our monitor to improve the trustworthiness of dynamic robotic systems, such as quadrotors or autonomous vehicles.
arXiv Detail & Related papers (2024-07-11T17:59:22Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.