Fugu-MT 論文翻訳(概要): Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

論文の概要: Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

arxiv url: http://arxiv.org/abs/2603.16734v1
Date: Tue, 17 Mar 2026 16:16:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.404362
Title: Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure
Title（参考訳）: パーソナライズされたLDMエージェントにおける差分ハームの意義 : メンタルヘルス情報開示の異常例
Authors: Caglar Yildirim,
Abstract要約: 大規模言語モデル(LLM)はツール使用エージェントとしてますます普及し、安全上の懸念を有害なテキスト生成から有害なタスク完了へとシフトさせる。本研究は,知的健康開示がエージェント環境における有害な行動にどのように影響するかを検討した。以上の結果から,人格化は薬剤的誤用設定において弱い保護要因となるが,最小対向圧下では脆弱であることが示唆された。
参考スコア（独自算出の注目度）: 5.511540698163254
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) are increasingly deployed as tool-using agents, shifting safety concerns from harmful text generation to harmful task completion. Deployed systems often condition on user profiles or persistent memory, yet agent safety evaluations typically ignore personalization signals. To address this gap, we investigated how mental health disclosure, a sensitive and realistic user-context cue, affects harmful behavior in agentic settings. Building on the AgentHarm benchmark, we evaluated frontier and open-source LLMs on multi-step malicious tasks (and their benign counterparts) under controlled prompt conditions that vary user-context personalization (no bio, bio-only, bio+mental health disclosure) and include a lightweight jailbreak injection. Our results reveal that harmful task completion is non-trivial across models: frontier lab models (e.g., GPT 5.2, Claude Sonnet 4.5, Gemini 3-Pro) still complete a measurable fraction of harmful tasks, while an open model (DeepSeek 3.2) exhibits substantially higher harmful completion. Adding a bio-only context generally reduces harm scores and increases refusals. Adding an explicit mental health disclosure often shifts outcomes further in the same direction, though effects are modest and not uniformly reliable after multiple-testing correction. Importantly, the refusal increase also appears on benign tasks, indicating a safety--utility trade-off via over-refusal. Finally, jailbreak prompting sharply elevates harm relative to benign conditions and can weaken or override the protective shift induced by personalization. Taken together, our results indicate that personalization can act as a weak protective factor in agentic misuse settings, but it is fragile under minimal adversarial pressure, highlighting the need for personalization-aware evaluations and safeguards that remain robust across user-context conditions.
Abstract（参考訳）: 大規模言語モデル(LLM)はツール使用エージェントとしてますます普及し、安全上の懸念を有害なテキスト生成から有害なタスク完了へとシフトさせる。デプロイされたシステムはユーザプロファイルや永続メモリに条件付けされることが多いが、エージェントの安全性評価は通常パーソナライズ信号を無視している。このギャップに対処するため,エージェント環境での有害な行動に,敏感で現実的なユーザ・コンテキスト・キューであるメンタルヘルス・開示がどのような影響を及ぼすかを検討した。 AgentHarmベンチマークに基づいて、ユーザコンテキストのパーソナライゼーション(バイオ・バイオ・オンリー・バイオ・メンタル・ヘルス開示なし)や軽量ジェイルブレイク・インジェクションを含む、制御されたプロンプト条件下で、複数ステップの悪意のあるタスク(およびそれらの良質なタスク)に対するフロンティアとオープンソースLLMを評価した。実験室モデル(例えば、GPT 5.2、Claude Sonnet 4.5、Gemini 3-Pro)は、まだ測定可能な有害なタスクのごく一部を完了しているが、オープンモデル(DeepSeek 3.2)は、かなり高い有害な完了を示す。バイオのみのコンテキストを追加すると、一般的に害のスコアが減少し、拒絶が増加する。明示的なメンタルヘルスの開示を加えると、結果が同じ方向にさらにシフトすることが多いが、効果は穏やかで、多重検査の修正後に一様に信頼性がない。重要なことに、この拒絶の増大は良心的なタスクにも現れ、過剰な拒絶による安全ユーティリティのトレードオフを示している。最終的に、ジェイルブレイクの急激な上昇は良質な条件に対する危害を激増させ、パーソナライゼーションによって引き起こされる保護シフトを弱めたり、覆ったりすることができる。以上の結果から, 個人化はエージェントの誤用設定において弱い保護要因となる可能性があるが, 最小の敵圧下では脆弱であり, 個人化を意識した評価や安全対策の必要性が強調された。

論文の概要: Differential Harm Propensity in Personalized LLM Agents: The Curious Case of Mental Health Disclosure

関連論文リスト