Fugu-MT 論文翻訳(概要): Questionnaire Responses Do not Capture the Safety of AI Agents

論文の概要: Questionnaire Responses Do not Capture the Safety of AI Agents

arxiv url: http://arxiv.org/abs/2603.14417v1
Date: Sun, 15 Mar 2026 15:01:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.800107
Title: Questionnaire Responses Do not Capture the Safety of AI Agents
Title（参考訳）: AIエージェントの安全性を捉えない質問応答
Authors: Max Hellrigel-Holderbaum, Edward James Young,
Abstract要約: 急速に成長するAI研究の分野は、このようなアセスメントの開発に費やされている。標準手法は、仮説的なシナリオでそれらの値や振る舞いを記述するために、アンケート形式で大きな言語モデル(LLM)を誘導する。構造的に同一の問題は、現在のAIアライメントアプローチに当てはまる、と私たちは主張する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As AI systems advance in capabilities, measuring their safety and alignment to human values is becoming paramount. A fast-growing field of AI research is devoted to developing such assessments. However, most current advances therein may be ill-suited for assessing AI systems across real-world deployments. Standard methods prompt large language models (LLMs) in a questionnaire-style to describe their values or behavior in hypothetical scenarios. By focusing on unaugmented LLMs, they fall short of evaluating AI agents, which could actually perform relevant behaviors, hence posing much greater risks. LLMs' engagement with scenarios described by questionnaire-style prompts differs starkly from that of agents based on the same LLMs, as reflected in divergences in the inputs, possible actions, environmental interactions, and internal processing. As such, LLMs' responses to scenario descriptions are unlikely to be representative of the corresponding LLM agents' behavior. We further contend that such assessments make strong assumptions concerning the ability and tendency of LLMs to report accurately about their counterfactual behavior. This makes them inadequate to assess risks from AI systems in real-world contexts as they lack construct validity. We then argue that a structurally identical issue holds for current AI alignment approaches. Lastly, we discuss improving safety assessments and alignment training by taking these shortcomings to heart.
Abstract（参考訳）: AIシステムの能力向上に伴い、人間の価値に対する安全性とアライメントを測定することが最重要になっている。急速に成長するAI研究の分野は、このようなアセスメントの開発に費やされている。しかし、現在のほとんどの進歩は、現実のデプロイメント全体にわたってAIシステムを評価するのに不適当かもしれない。標準手法は、仮説的なシナリオでそれらの値や振る舞いを記述するために、アンケート形式で大きな言語モデル(LLM)を誘導する。拡張されていないLLMに焦点を当てることで、AIエージェントの評価に足りなくなり、実際に関連する行動を実行することができ、その結果、はるかに大きなリスクが生じる。アンケートスタイルのプロンプトによって説明されるシナリオに対するLLMの関与は、入力、可能なアクション、環境相互作用、内部処理に反映されるように、同一のLLMに基づくエージェントと大きく異なる。そのため、シナリオ記述に対するLLMの反応は、対応するLLMエージェントの振る舞いを表すことはありそうにない。さらに,これらの評価がLCMの能力や傾向について強い仮定をしており,その反事実行動について正確に報告できることを論じる。これにより、構築の妥当性が欠如していることから、現実のコンテキストにおけるAIシステムからのリスクを評価するのが不十分になる。そして、現在のAIアライメントアプローチには、構造的に同一の問題がある、と論じます。最後に、これらの欠点を心に留めて、安全性評価とアライメントトレーニングの改善について検討する。

論文の概要: Questionnaire Responses Do not Capture the Safety of AI Agents

関連論文リスト