Fugu-MT 論文翻訳(概要): PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

論文の概要: PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

arxiv url: http://arxiv.org/abs/2605.15665v1
Date: Fri, 15 May 2026 06:43:07 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-18 21:22:26.194418
Title: PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI
Title（参考訳）: PRISM: エンタープライズ会話型AIの反復シミュレーションとモニタリングによる信頼性向上
Authors: Keshava Chaitanya, Jahnavi Gundakaram,
Abstract要約: PRISM(Prompt Reliability via Iterative Simulation and Monitoring)は、継続的信頼性エンジニアリング問題として迅速なエンジニアリングを扱うクローズドループフレームワークである。我々は,Yellow.ai V3プラットフォーム上での3週間の展開期間において,35件の企業会話エージェントを対象としたPRISMを評価した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Deploying large language model (LLM)-driven conversational agents in enterprise settings requires prompts that are simultaneously correct at launch and resilient to the non-deterministic behavioral drift that characterizes production LLM deployments. Existing prompt optimization frameworks address prompt quality as a one-time compile-time problem, leaving open the equally critical question of how to detect and repair prompt regressions caused by silent LLM behavior changes over time. We present PRISM (Prompt Reliability via Iterative Simulation and Monitoring), a closed-loop framework that treats prompt engineering as a continuous reliability engineering problem rather than a one-time authorship task. PRISM takes as input plain-language agent requirements, a set of configured tools and memory variables, and an initial draft prompt. It automatically generates test cases from requirements, simulates full multi-turn conversations against a platform-faithful LLM environment, evaluates pass/fail using an LLM-as-judge, diagnoses root causes of failures, and surgically repairs the prompt -- iterating until all tests pass. Critically, PRISM is designed to run on a scheduled basis (daily), treating LLM behavioral drift as a first-class reliability concern. We evaluate PRISM across 35 enterprise conversational agents over a three-week deployment period on the Yellow.ai V3 platform. PRISM reduces median prompt authoring time from 2 days to under 30 minutes, achieves 99% production reliability across all evaluated agents, and successfully identifies and repairs production regressions caused by LLM behavioral drift within a 24-hour detection window. Our results suggest that continuous, simulation-driven prompt optimization is both tractable and necessary for reliable enterprise conversational AI at scale.
Abstract（参考訳）: エンタープライズ環境での大規模言語モデル(LLM)駆動の会話エージェントのデプロイには、起動時に同時に正しいプロンプトと、本番LLMデプロイメントを特徴付ける非決定論的行動ドリフトに対するレジリエントなプロンプトが必要である。既存のプロンプト最適化フレームワークは、プロンプト品質を1回コンパイル時の問題として扱い、サイレントLLMの動作変化によるプロンプト回帰の検出と修復について、同様に重要な質問を時間とともに開いている。 PRISM(Prompt Reliability via Iterative Simulation and Monitoring)は,ワンタイムのオーサシップタスクではなく,迅速なエンジニアリングを継続的信頼性エンジニアリング問題として扱うクローズドループフレームワークである。 PRISMは入力プレーン言語エージェント要件、設定されたツールとメモリ変数のセット、および初期ドラフトプロンプトである。要件からテストケースを自動的に生成し、プラットフォームに忠実なLDM環境に対する完全なマルチターン会話をシミュレートし、LSM-as-judgeを使用してパス/フェイルを評価し、障害の根本原因を診断し、すべてのテストが通過するまでプロンプトを外科的に修復する。批判的には、PRISMは(日々)スケジュール通りに動作するよう設計されており、LCMの振る舞いのドリフトを第一級信頼性の懸念事項として扱う。我々は,Yellow.ai V3プラットフォーム上での3週間の展開期間において,35件の企業会話エージェントを対象としたPRISMを評価した。 PRISMは2日以内から30分以内のプロンプトオーサリング時間を減らし、すべての評価エージェントに対して99%の信頼性を実現し、24時間検出ウィンドウ内でのLLMの挙動ドリフトによる生産遅延の特定と修復に成功している。この結果から,大規模で信頼性の高い企業会話型AIを実現するためには,連続的かつシミュレーション駆動型のプロンプト最適化が不可欠であることが示唆された。

論文の概要: PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

関連論文リスト