Fugu-MT 論文翻訳(概要): An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

論文の概要: An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

arxiv url: http://arxiv.org/abs/2509.19185v2
Date: Wed, 24 Sep 2025 14:15:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-25 14:09:11.261299
Title: An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Title（参考訳）: オープンソースAIエージェントフレームワークとエージェントアプリケーションにおけるテスト実践に関する実証的研究
Authors: Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan,
Abstract要約: ファンデーションモデル(FM)ベースのAIエージェントは、さまざまなドメインで急速に採用されている。その固有の非決定主義と非再現性は、テストと品質保証の課題を引き起こす。 AIエージェントエコシステムにおけるテストプラクティスの大規模な実証的研究を初めて実施する。
参考スコア（独自算出の注目度）: 12.166151903597445
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.
Abstract（参考訳）: ファンデーションモデル(FM)ベースのAIエージェントは、さまざまなドメインで急速に採用されているが、その固有の非決定性と非再現性は、テストと品質保証の課題を引き起こす。最近のベンチマークでは、タスクレベルの評価が提供されているが、開発中にこれらのエージェントの内部的正当性を検証する方法については、限定的な理解がされている。このギャップに対処するため、我々はAIエージェントエコシステムにおけるテストプラクティスの大規模な実証的研究を行い、39のオープンソースエージェントフレームワークと439のエージェントアプリケーションを分析した。私たちは10の異なるテストパターンを特定し、DeepEvalのような新しいエージェント固有の手法がほとんど使われていない(約1%)のに対して、ネガティブテストやメンバシップテストのような従来のパターンはFMの不確実性を管理するために広く適用されている。リソースアーティファクト(ツール)やコーディネーションアーティファクト(ワークフロー)といった決定論的コンポーネントは、テストの70%以上を消費しますが、FMベースのPlan Bodyは5%以下です。 Triggerコンポーネント(prompts)は依然として無視されており、すべてのテストの約1%に現れている。 FMをベースとしたエージェントフレームワークとエージェントアプリケーションにおいて,初となる実証試験ベースラインが提供され,非決定論への合理的かつ不完全な適応が明らかとなった。これに対処するためには、フレームワーク開発者は新しいテストメソッドのサポートを改善し、アプリケーション開発者は即時回帰テストを採用する必要がある。これらのプラクティスを強化することは、より堅牢で信頼性の高いAIエージェントを構築する上で不可欠である。

論文の概要: An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

関連論文リスト