Fugu-MT 論文翻訳(概要): Insider Attacks in Multi-Agent LLM Consensus Systems

論文の概要: Insider Attacks in Multi-Agent LLM Consensus Systems

arxiv url: http://arxiv.org/abs/2605.08268v1
Date: Fri, 08 May 2026 03:10:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:49.51651
Title: Insider Attacks in Multi-Agent LLM Consensus Systems
Title（参考訳）: マルチエージェントLLMコンセンサスシステムにおけるインサイダー攻撃
Authors: Xiaolin Sun, Zixuan Liu, Yibin Hu, Zizhan Zheng,
Abstract要約: マルチエージェントコンセンサスシステムにおけるインサイダー操作について検討する。そこで本稿では,良性エージェントの潜在行動状態上でのサロゲートダイナミクスを学習する世界モデルベースのフレームワークを提案する。予備的な結果は、訓練された攻撃者は、良心的コンセンサス率を減らし、直接悪質なプロンプトベースラインよりも、より効果的に不一致を延長することを示している。
参考スコア（独自算出の注目度）: 8.207909009186091
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are increasingly deployed in multi-agent systems where agents communicate in natural language to solve tasks jointly. A key capability in such systems is consensus formation, where agents iteratively exchange messages and update decisions to reach a shared outcome. However, most existing multi-agent LLM frameworks assume that all participating agents are aligned with the system objective. In practice, a malicious insider may participate as a legitimate member of the group while pursuing a hidden adversarial goal. In this work, we study insider manipulation in multi-agent LLM consensus systems. We formalize the problem as a sequential decision-making task in which a malicious agent seeks to delay or prevent agreement among benign agents. To make attack optimization tractable, we propose a world-model-based framework that learns surrogate dynamics over the latent behavioral states of benign agents and then trains an attacker using reinforcement learning based on this learned model. Preliminary results show that the trained attacker reduces the benign consensus rate and prolongs disagreement more effectively than the direct malicious-prompt baseline. These results suggest that combining latent world models with reinforcement learning is a promising direction for adaptive insider attacks in language-based multi-agent systems.
Abstract（参考訳）: 大規模言語モデル(LLM)は、エージェントが自然言語でコミュニケーションしてタスクを共同で解決するマルチエージェントシステムに、ますます多くデプロイされている。このようなシステムにおける重要な機能はコンセンサス形成であり、エージェントはメッセージを反復的に交換し、決定を更新して共通の結果に達する。しかしながら、既存のマルチエージェント LLM フレームワークの多くは、すべてのエージェントがシステム目標に一致していると仮定している。実際には、悪意のあるインサイダーは、隠れた敵の目標を追求しながら、グループの正当なメンバーとして参加することができる。本研究では,マルチエージェントLLMコンセンサスシステムにおけるインサイダー操作について検討する。我々は、悪質なエージェントが良質なエージェント間の合意を遅らせたり阻止したりしようとする、シーケンシャルな意思決定タスクとして問題を定式化する。攻撃最適化を実現するために,この学習モデルに基づく強化学習を用いて攻撃者を訓練し,良性エージェントの潜伏行動状態に対する代理ダイナミクスを学習するワールドモデルベースのフレームワークを提案する。予備的な結果は、訓練された攻撃者は、良心的コンセンサス率を減らし、直接悪質なプロンプトベースラインよりも、より効果的に不一致を延長することを示している。これらの結果は、言語ベースのマルチエージェントシステムにおいて、潜在世界モデルと強化学習を組み合わせることが、適応型インサイダー攻撃の有望な方向であることを示唆している。

論文の概要: Insider Attacks in Multi-Agent LLM Consensus Systems

関連論文リスト