Fugu-MT 論文翻訳(概要): Evaluating Language Models for Harmful Manipulation

論文の概要: Evaluating Language Models for Harmful Manipulation

arxiv url: http://arxiv.org/abs/2603.25326v2
Date: Fri, 27 Mar 2026 17:09:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 13:48:18.826692
Title: Evaluating Language Models for Harmful Manipulation
Title（参考訳）: 有害マニピュレーションのための言語モデルの評価
Authors: Canfer Akbulut, Rasmi Elasmar, Abhishek Roy, Anthony Payne, Priyanka Suresh, Lujain Ibrahim, Seliem El-Sayed, Charvi Rastogi, Ashyana Kachra, Will Hawkins, Kristian Lum, Laura Weidinger,
Abstract要約: 本稿では,文脈特異的な人間-AIインタラクション研究を通じて有害なAI操作を評価する枠組みを提案する。我々は,3つのAI利用領域(公共政策,金融,健康)と3つの地域(米国,英国,インド)のインタラクションにまたがる10,101人の参加者によるAIモデルを評価する。実験実験では,実験参加者の信念や行動変化を誘発することができる。
参考スコア（独自算出の注目度）: 4.833632272271989
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interest in the concept of AI-driven harmful manipulation is growing, yet current approaches to evaluating it are limited. This paper introduces a framework for evaluating harmful AI manipulation via context-specific human-AI interaction studies. We illustrate the utility of this framework by assessing an AI model with 10,101 participants spanning interactions in three AI use domains (public policy, finance, and health) and three locales (US, UK, and India). Overall, we find that that the tested model can produce manipulative behaviours when prompted to do so and, in experimental settings, is able to induce belief and behaviour changes in study participants. We further find that context matters: AI manipulation differs between domains, suggesting that it needs to be evaluated in the high-stakes context(s) in which an AI system is likely to be used. We also identify significant differences across our tested geographies, suggesting that AI manipulation results from one geographic region may not generalise to others. Finally, we find that the frequency of manipulative behaviours (propensity) of an AI model is not consistently predictive of the likelihood of manipulative success (efficacy), underscoring the importance of studying these dimensions separately. To facilitate adoption of our evaluation framework, we detail our testing protocols and make relevant materials publicly available. We conclude by discussing open challenges in evaluating harmful manipulation by AI models.
Abstract（参考訳）: AIによる有害な操作の概念への関心は高まりつつあるが、それを評価するための現在のアプローチは限られている。本稿では,文脈特異的な人間-AIインタラクション研究を通じて有害なAI操作を評価する枠組みを提案する。このフレームワークの有用性を,3つのAI利用領域(公共政策,金融,健康)と3つの地域(米国,英国,インド)のインタラクションにまたがる10,101人の参加者によるAIモデルの評価によって説明する。実験結果から,実験モデルによって操作行動が生成され,実験条件下では,実験参加者の信念や行動変化を誘発することができることがわかった。 AIの操作はドメインによって異なるため、AIシステムが使用される可能性が高い高い状況で評価する必要があることを示唆している。また、テスト対象地域間で有意な差異が認められており、ある地理的領域からのAI操作結果が他の領域に一般化されない可能性があることを示唆している。最後に、AIモデルの操作行動(正当性)の頻度は、操作的成功(有効性)の可能性を常に予測しておらず、これらの次元を別々に研究することの重要性を強調している。評価フレームワークの採用を容易にするため,テストプロトコルを詳述し,関連する資料を公開する。我々は,AIモデルによる有害な操作を評価する上でのオープンな課題について論じる。

論文の概要: Evaluating Language Models for Harmful Manipulation

関連論文リスト