Fugu-MT 論文翻訳(概要): Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

論文の概要: Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

arxiv url: http://arxiv.org/abs/2604.27093v1
Date: Wed, 29 Apr 2026 18:37:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:53.757887
Title: Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations
Title（参考訳）: 無駄だが安全か? マルチスレッド会話におけるユーザインテントの明確化によるユーザビリティ回復のベンチマーク
Authors: Mingqian Zheng, Malia Morgan, Liwei Jiang, Carolyn Rose, Maarten Sap,
Abstract要約: 我々は,LCMがユーザ意図の解釈を改訂し,有用性を回復できるかどうかを計測する初の対話型ベンチマークであるCarryOnBenchを紹介する。ユーザ追跡シーケンスの異なる5,970の会話をシミュレートし,意図整合性と安全性の両面で14のモデルを評価する。 CarryOnBenchは、4-12ターンで1,866の異なる会話フローを生成し、合計で23,880のモデル応答を生成する。
参考スコア（独自算出の注目度）: 32.23729177914094
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Current LLM safety alignment techniques improve model robustness against adversarial attacks, but overlook whether and how LLMs can recover helpfulness when benign users clarify their intent. We introduce CarryOnBench, the first interactive benchmark that measures whether LLMs can revise their interpretation of user intent and recover utility, while remaining safe through multi-turn conversations. Starting from 398 seemingly harmful queries with benign underlying intents, we simulate 5,970 conversations by varying user follow-up sequences, evaluating 14 models on both intent-aligned utility and safety. CarryOnBench yields 1,866 different conversation flows of 4--12 turns, totaling 23,880 model responses. We design Ben-Util, a checklist-based metric that evaluates how well each model response fulfills the user's benign information need using atomic items. At turn one, models fulfill only 10.5--37.6% of the user's benign information need. When the same query includes the benign intent upfront, models fulfill 25.1--72.1%, confirming that models withhold information due to intent misinterpretation, not limited knowledge. With benign clarifications in multi-turn conversations, 13 of 14 models approach or exceed this single-turn baseline, yet recovery cost varies across models. We identify three failure modes invisible to single-turn evaluations: utility lock-in, where a model rarely updates despite clarification; unsafe recovery, where a model updates at disproportionate safety cost; and repetitive recovery, where a model recycles prior responses rather than providing new information. Moreover, conversations converge to similar harmfulness levels regardless of how conservative the model starts. These findings expose a gap that single-turn evaluations miss -- whether a model is appropriately cautious or simply unresponsive to clarified user intent.
Abstract（参考訳）: 現在のLLM安全アライメント技術は、敵攻撃に対するモデルロバスト性を改善するが、良質なユーザがその意図を明確化する際に、LLMが役に立つかどうか、そしてどのようにして回復するかを見落としている。私たちは,LLMがユーザ意図の解釈を改訂し,有用性を回復できるかどうかを,マルチターン会話を通じて安全を維持しながら測定する,初の対話型ベンチマークであるCarryOnBenchを紹介した。 398から、良質な意図を持つ有害なクエリとして、5,970の会話を様々なユーザ追跡シーケンスでシミュレートし、意図に整合したユーティリティと安全性の両方で14のモデルを評価する。 CarryOnBenchは、4-12ターンで1,866の異なる会話フローを生成し、合計で23,880のモデル応答を生成する。チェックリストに基づくメトリクスであるBen-Utilを設計し、各モデル応答が、アトミックアイテムを使用して、ユーザの良識情報をどのように満たすかを評価する。第一に、モデルはユーザーの良心的情報要求の10.5--37.6%しか満たさない。同じクエリが前もって良心的インテントを含む場合、モデルは25.1--72.1%を満たす。マルチターン会話における良質な明確化により、14モデル中13モデルがこのシングルターンベースラインに近づいたり、超えたりするが、リカバリコストはモデルによって異なる。単一ターン評価では見えない3つの障害モードを識別する: ユーティリティロックイン: 明確化にもかかわらずモデルが更新されることが稀なユーティリティロックイン; 安全でないリカバリ: 不均衡な安全コストでモデルが更新されること; 繰り返しリカバリ: モデルが新しい情報を提供するのではなく、事前応答をリサイクルすること。さらに、モデルがどれだけ保守的であっても、会話は同様の有害度レベルに収束する。これらの発見は、モデルが適切に慎重であるか、単にユーザの意図を明確にすることに対して反応しないかに関わらず、シングルターン評価が見逃すギャップを明らかにします。

論文の概要: Useless but Safe? Benchmarking Utility Recovery with User Intent Clarification in Multi-Turn Conversations

関連論文リスト