Fugu-MT 論文翻訳(概要): Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

論文の概要: Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

arxiv url: http://arxiv.org/abs/2508.20737v1
Date: Thu, 28 Aug 2025 13:00:28 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-29 18:12:02.397855
Title: Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol
Title（参考訳）: LLMアプリケーションのテスト再考: 特性,課題,軽量インタラクションプロトコル
Authors: Wei Ma, Yixiao Yang, Qiang Hu, Shi Ying, Zhi Jin, Bo Du, Zhenchang Xing, Tianlin Li, Junjie Shi, Yang Liu, Linxiao Jiang,
Abstract要約: 大言語モデル(LLM)は、単純なテキストジェネレータから、検索強化、ツール呼び出し、マルチターンインタラクションを統合する複雑なソフトウェアシステムへと進化してきた。その固有の非決定主義、ダイナミズム、文脈依存は品質保証に根本的な課題をもたらす。本稿では,LLMアプリケーションを3層アーキテクチャに分解する: textbftextitSystem Shell Layer, textbftextitPrompt Orchestration Layer, textbftextitLLM Inference Core。
参考スコア（独自算出の注目度）: 83.83217247686402
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Applications of Large Language Models~(LLMs) have evolved from simple text generators into complex software systems that integrate retrieval augmentation, tool invocation, and multi-turn interactions. Their inherent non-determinism, dynamism, and context dependence pose fundamental challenges for quality assurance. This paper decomposes LLM applications into a three-layer architecture: \textbf{\textit{System Shell Layer}}, \textbf{\textit{Prompt Orchestration Layer}}, and \textbf{\textit{LLM Inference Core}}. We then assess the applicability of traditional software testing methods in each layer: directly applicable at the shell layer, requiring semantic reinterpretation at the orchestration layer, and necessitating paradigm shifts at the inference core. A comparative analysis of Testing AI methods from the software engineering community and safety analysis techniques from the AI community reveals structural disconnects in testing unit abstraction, evaluation metrics, and lifecycle management. We identify four fundamental differences that underlie 6 core challenges. To address these, we propose four types of collaborative strategies (\emph{Retain}, \emph{Translate}, \emph{Integrate}, and \emph{Runtime}) and explore a closed-loop, trustworthy quality assurance framework that combines pre-deployment validation with runtime monitoring. Based on these strategies, we offer practical guidance and a protocol proposal to support the standardization and tooling of LLM application testing. We propose a protocol \textbf{\textit{Agent Interaction Communication Language}} (AICL) that is used to communicate between AI agents. AICL has the test-oriented features and is easily integrated in the current agent framework.
Abstract（参考訳）: 大規模言語モデル~(LLM)の応用は、単純なテキストジェネレータから、検索強化、ツール呼び出し、マルチターンインタラクションを統合する複雑なソフトウェアシステムへと進化してきた。その固有の非決定主義、ダイナミズム、文脈依存は品質保証に根本的な課題をもたらす。本稿では,LLMアプリケーションを3層アーキテクチャに分解する: \textbf{\textit{System Shell Layer}}, \textbf{\textit{Prompt Orchestration Layer}}, \textbf{\textit{LLM Inference Core}}。次に、各レイヤにおける従来のソフトウェアテストメソッドの適用性を評価します。シェル層に直接適用し、オーケストレーション層での意味的再解釈を必要とし、推論コアでのパラダイムシフトを必要とします。ソフトウェアエンジニアリングコミュニティによるテストAIメソッドの比較分析と、AIコミュニティによる安全分析技術は、ユニットテストの抽象化、評価メトリクス、ライフサイクル管理における構造的分離を明らかにしている。 6つの課題の根底にある4つの根本的な違いを特定します。これらの問題に対処するために,我々は,事前デプロイ検証とランタイム監視を組み合わせた,クローズドループで信頼性の高い品質保証フレームワークを探索する,4種類の協調戦略(\emph{Retain}, \emph{Translate}, \emph{Integrate}, \emph{Runtime})を提案する。これらの戦略に基づいて、LLMアプリケーションテストの標準化とツーリングをサポートするための実践的なガイダンスとプロトコルの提案を提供する。本稿では,AIエージェント間の通信に使用されるプロトコルである 'textbf{\textit{Agent Interaction Communication Language}} (AICL) を提案する。 AICLはテスト指向の機能を持ち、現在のエージェントフレームワークに簡単に統合できる。

論文の概要: Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

関連論文リスト