Fugu-MT 論文翻訳(概要): MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

論文の概要: MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

arxiv url: http://arxiv.org/abs/2508.19163v1
Date: Tue, 26 Aug 2025 16:12:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-27 17:42:38.913158
Title: MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation
Title（参考訳）: MATRIX:Multi-Agent simulaTion fRamework for safe Interactions and conteXtual conversational evaluation
Authors: Ernest Lim, Yajie Vera He, Jared Joselowitz, Kate Preston, Mohita Chowdhury, Louis Williams, Aisling Higham, Katrina Mason, Mariane Melo, Tom Lawton, Yan Jia, Ibrahim Habli,
Abstract要約: MATRIXは、臨床対話エージェントの安全性指向評価のための構造化エンジニアリングフレームワークである。臨床シナリオ、期待されるシステム行動、障害モードの安全性に整合した分類、安全性に関連する対話障害を検出する評価ツールであるBehvJudge、シミュレーションされた患者エージェントであるPatBotを統合している。 3つの実験で、MATRIXは系統的かつスケーラブルな安全性評価を可能にすることを示した。
参考スコア（独自算出の注目度）: 3.9146063017280923
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite the growing use of large language models (LLMs) in clinical dialogue systems, existing evaluations focus on task completion or fluency, offering little insight into the behavioral and risk management requirements essential for safety-critical systems. This paper presents MATRIX (Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation), a structured, extensible framework for safety-oriented evaluation of clinical dialogue agents. MATRIX integrates three components: (1) a safety-aligned taxonomy of clinical scenarios, expected system behaviors and failure modes derived through structured safety engineering methods; (2) BehvJudge, an LLM-based evaluator for detecting safety-relevant dialogue failures, validated against expert clinician annotations; and (3) PatBot, a simulated patient agent capable of producing diverse, scenario-conditioned responses, evaluated for realism and behavioral fidelity with human factors expertise, and a patient-preference study. Across three experiments, we show that MATRIX enables systematic, scalable safety evaluation. BehvJudge with Gemini 2.5-Pro achieves expert-level hazard detection (F1 0.96, sensitivity 0.999), outperforming clinicians in a blinded assessment of 240 dialogues. We also conducted one of the first realism analyses of LLM-based patient simulation, showing that PatBot reliably simulates realistic patient behavior in quantitative and qualitative evaluations. Using MATRIX, we demonstrate its effectiveness in benchmarking five LLM agents across 2,100 simulated dialogues spanning 14 hazard scenarios and 10 clinical domains. MATRIX is the first framework to unify structured safety engineering with scalable, validated conversational AI evaluation, enabling regulator-aligned safety auditing. We release all evaluation tools, prompts, structured scenarios, and datasets.
Abstract（参考訳）: 臨床対話システムにおける大きな言語モデル(LLM)の利用が増加しているにもかかわらず、既存の評価はタスクの完了や流布に重点を置いており、安全クリティカルなシステムに必要な行動やリスク管理の要件についてはほとんど洞察を提供していない。本稿では,臨床対話エージェントの安全性指向評価のための構造化された拡張可能なフレームワークであるMATRIX(Multi-Agent simulaTion fRamework forSafe Interactions and conteXtual Clinical conversational Evaluation)を提案する。 MATRIX は,(1) 臨床シナリオの安全性に整合した分類,構造的安全工学手法によるシステム行動,障害モード,(2) 安全関連対話障害を検出するための LLM ベースの評価器である BehvJudge と,(3) 患者エージェントである PatBot の3つの構成要素を統合した。 3つの実験で、MATRIXは系統的かつスケーラブルな安全性評価を可能にすることを示した。 Gemini 2.5-Proを用いたBehvJudgeは、専門レベルのハザード検出(F1 0.96、感度0.999)を達成し、240のダイアログを盲検で評価した。また, LLMをベースとした患者シミュレーションにおける最初のリアリズム分析を行い, PatBotは定量的, 質的な評価において, 現実的な患者の行動を確実にシミュレートすることを示した。 MATRIXを用いて、14のハザードシナリオと10の臨床領域にまたがる2100のシミュレーション対話において、5つのLDMエージェントをベンチマークする効果を実証した。 MATRIXは、構造化安全工学をスケーラブルで検証された会話AI評価で統一する最初のフレームワークであり、規制に整合した安全監査を可能にする。すべての評価ツール、プロンプト、構造化シナリオ、データセットをリリースします。

論文の概要: MATRIX: Multi-Agent simulaTion fRamework for safe Interactions and conteXtual clinical conversational evaluation

関連論文リスト