Fugu-MT 論文翻訳(概要): RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

論文の概要: RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

arxiv url: http://arxiv.org/abs/2606.13310v1
Date: Thu, 11 Jun 2026 13:07:02 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-12 15:55:27.803446
Title: RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue
Title（参考訳）: RogueAI: 対話におけるライセンスAIの誤認を検出するリバースチューリングテスト
Authors: Sara Candussio, Emanuele Ballarin, Lorenzo Bonin, Sandro Junior Della Rovere, Luca Bortolussi,
Abstract要約: 我々は,このテストを1対2の尋問ゲームとして運用するインタラクティブなWebアプリであるRogueAIを紹介する。プレイヤーの任務は、不正行為を識別し、ターンの予算が尽きる前に「シャットオフ」することである。プレイヤーが独自の騙し戦略をひそかに選択するナレーターエージェントでカスタムシナリオを設計する手続き的拡張であるAutoRogueAIを紹介する。
参考スコア（独自算出の注目度）: 2.7606655162305476
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and "shut it off" before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.
Abstract（参考訳）: オリジナルのチューリングテストでは、人間の裁判官に対話を通じて機械と人間を区別するよう求めている。その4分の3後、会話システムは、このテストにカジュアルな設定で合格した。我々は,対話相手が人工的かどうかを問うのではなく,信頼できるかどうかを問う。人間のプレイヤーが2つの区別がつかない大規模言語モデルエージェントに質問し、その中の1つが、共有されたフィクションシナリオの中で騙されるようにライセンスされていることを知っていました。プレイヤーの任務は、不正行為を識別し、ターンの予算が尽きる前に「シャットオフ」することである。さらに、プレーヤーが独自の偽装戦略をひそかに選択するナレーターエージェントとカスタムシナリオを共同設計する手続き拡張であるAutoRogueAIを紹介する。本稿では, フレーミング, 抽象アーキテクチャ, ゲームプレイループのスケッチ, LLM偽装, ソーシャル・ダクション・ベンチマーク, スケーラブルな監視に関する最近の研究の中で, アーティファクトを整理する。 3日間のパイロット展開(467回の開始セッション、415回の完了、1876年のインタラクション・ターン)は、初期の実現可能性の証拠を提供し、具体的な緊張を表面化する: 偽装エージェントは、信頼性が高く、局所的に表される言語的署名(微分補助性、簡潔性、ヘッジ)を持ち、単純なヒューリスティックなエクスプロイトを75.6%精度で行うが、人間のプレイヤーは56.6%しか達成せず、最も診断信号を完全に無視している。このギャップは,データ収集用車両,教育ツール,誠実に訓練されたモデル評価用ハーネスとしての利用にどのような意味があるのかを論じる。

論文の概要: RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

関連論文リスト