Fugu-MT 論文翻訳(概要): PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

論文の概要: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

arxiv url: http://arxiv.org/abs/2604.17359v1
Date: Sun, 19 Apr 2026 10:05:25 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-21 21:52:52.485972
Title: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations
Title（参考訳）: サイコベンチ:大規模言語モデルメンタルヘルスシミュレーションにおける疫学的忠実度の検討
Authors: Patrick Keough,
Abstract要約: 大きな言語モデルは、臨床訓練、研究、メンタルヘルスツールのために患者をシミュレートするためにますます多くデプロイされている。 LLM患者シミュレーションの最初の疫学的検査である PsychBench を紹介した。モデルでは, 抽出した個体群を誤って表現しながら, 臨床的に有意な個体を生成できることが示唆された。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models are increasingly deployed to simulate patients for clinical training, research, and mental health tools, yet population-level validity remains largely untested. We introduce PsychBench, the first epidemiological audit of LLM patient simulation: 28,800 profiles from four frontier models (GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7) evaluated against NHANES and NESARC-III baselines across 120 intersectional cohorts. The central finding is a coherence-fidelity dissociation: models produce clinically plausible individuals while misrepresenting the populations they are drawn from. Variance compression ranges from 14 percent (GLM-4.7) to 62 percent (DeepSeek-V3), eliminating the distributional tails of clinical reality. Despite test-retest correlations above r = 0.90, 36.66 percent of cases cross diagnostic thresholds between runs. Symptom correlation matrices diverge across demographic groups beyond split-half noise, with transgender populations diverging three to five times more than racial differences. Calibration bias is systematic and asymmetric. Models overestimate depression severity for most groups by 3.6 to 6.1 points (Cohen d = 1.13 to 1.91), consistent with training on clinical corpora with elevated base rates. For transgender women the direction inverts: models capture only 8 to 46 percent of documented minority stress elevation, yielding a -5.42 residual (d = -1.55). Models also attribute irritability to Black men and fatigue to women beyond matched controls, encoding racialized and gendered assumptions. Patterns replicate across US and Chinese architectures, indicating failures tied to current training paradigms rather than isolated implementations. For most users, LLM mental health tools risk pathologizing ordinary distress; for transgender users, algorithmic erasure of genuine need. The patients look right. They do not represent real populations.
Abstract（参考訳）: 大きな言語モデルは、臨床訓練、研究、メンタルヘルスツールのために患者をシミュレートするためにますます多くデプロイされているが、人口レベルの妥当性はほとんどテストされていない。 NHANESおよびNESARC-IIIベースラインを120の交叉コホートで評価した4つのフロンティアモデル(GPT-4o-mini, DeepSeek-V3, Gemini-3-Flash, GLM-4.7)の28,800プロファイルについて紹介した。その中心となる発見はコヒーレンス・フィデリティの解離(英語版)である:モデルが引き出された集団を誤って表現しながら、臨床的にもっともらしい個人を生産する。変動圧縮は14% (GLM-4.7) から62% (DeepSeek-V3) まで変化し、臨床現実の分布尾を排除している。 r = 0.90以上のテスト-テスト相関にもかかわらず、36.6%のケースがラン間の診断しきい値を越えた。症状相関行列は、分裂半減音以上の人口集団に分散し、トランスジェンダーの人口は人種差の3倍から5倍に変化している。校正バイアスは体系的で非対称である。ほとんどの集団のうつ病重症度を3.6から6.1ポイント(コーエンd = 1.13から 1.91)と過大評価し、臨床コーパスのトレーニングと基準レートが上昇する。トランスジェンダーの女性には、方向が逆転する: 記録されたマイノリティストレスの8から46%しか記録されておらず、残差は5.42(d = -1.55)である。モデルはまた、黒人男性に刺激性があり、女性の疲労は一致した規制を超えたものであり、人種的および性別的な仮定を符号化している。パターンは米国と中国のアーキテクチャ間で複製され、独立した実装ではなく、現在のトレーニングパラダイムに結びついた失敗を示す。ほとんどのユーザにとって、LLMのメンタルヘルスツールは、通常の苦痛を謝罪するリスクがある。患者は正しく見えます。実際の人口を表すものではない。

論文の概要: PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

関連論文リスト