Fugu-MT 論文翻訳(概要): PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

論文の概要: PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

arxiv url: http://arxiv.org/abs/2604.25840v1
Date: Tue, 28 Apr 2026 16:46:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.954956
Title: PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators
Title（参考訳）: PSI-Bench : うつ病患者シミュレーションの臨床的・解釈的評価に向けて
Authors: Nguyen Khoi Hoang, Shuhaib Mehri, Tse-An Hsu, Yi-Jyun Sun, Quynh Xuan Nguyen Truong, Khoa D Doan, Dilek Hakkani-Tür,
Abstract要約: PSI-Benchは、うつ病患者シミュレーターの動作を解釈し、臨床的に根拠づけた診断を提供する自動評価フレームワークである。 PSI-Benchを用いて、2つのシミュレーターフレームワーク間で7つのLSMをベンチマークし、シミュレーターが過度に長く、語彙的に多様な応答を生成することを発見した。人間の研究では、ベンチマークが専門家の判断と強く一致していることが示されています。
参考スコア（独自算出の注目度）: 14.323763649788907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Patient simulators are gaining traction in mental health training by providing scalable exposure to complex and sensitive patient interactions. Simulating depressed patients is particularly challenging, as safety constraints and high patient variability complicate simulations and underscore the need for simulators that capture diverse and realistic patient behaviors. However, existing evaluations heavily rely on LLM-judges with poorly specified prompts and do not assess behavioral diversity. We introduce PSI-Bench, an automatic evaluation framework that provides interpretable, clinically grounded diagnostics of depression patient simulator behavior across turn-, dialogue-, and population-level dimensions. Using PSI-Bench, we benchmark seven LLMs across two simulator frameworks and find that simulators produce overly long, lexically diverse responses, show reduced variability, resolve emotions too quickly, and follow a uniform negative-to-positive trajectory. We also show that the simulation framework has a larger impact on fidelity than the model scale. Results from a human study demonstrate that our benchmark is strongly aligned with expert judgments. Our work reveals key limitations of current depression patient simulators and provides an interpretable, extensible benchmark to guide future simulator design and evaluation.
Abstract（参考訳）: 患者シミュレーターは、複雑で敏感な患者との対話にスケーラブルな露出を提供することによって、メンタルヘルストレーニングの牽引力を高めている。うつ病患者をシミュレーションすることは特に困難であり、安全性の制約と高い患者変動性はシミュレーションを複雑にし、多様な現実的な患者の振る舞いを捉えるシミュレータの必要性を浮き彫りにする。しかし、既存の評価はLLM-judgesに大きく依存しており、明確なプロンプトが乏しく、行動の多様性を評価できない。 PSI-Benchは, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, 転倒, PSI-Benchを用いて、2つのシミュレーターフレームワーク間で7つのLSMをベンチマークし、シミュレータが過度に長く、語彙的に多様な応答を生成し、ばらつきを減らし、感情を素早く解決し、均一な負対正の軌道に従うことを発見した。また,シミュレーションフレームワークがモデルスケールよりも忠実度に大きく影響していることも示す。人間の研究では、ベンチマークが専門家の判断と強く一致していることが示されています。我々の研究は、現在のうつ病患者シミュレーターの重要な限界を明らかにし、将来のシミュレーターの設計と評価をガイドするための解釈可能で拡張可能なベンチマークを提供する。

論文の概要: PSI-Bench: Towards Clinically Grounded and Interpretable Evaluation of Depression Patient Simulators

関連論文リスト