Fugu-MT 論文翻訳(概要): ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective

論文の概要: ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective

arxiv url: http://arxiv.org/abs/2509.21134v1
Date: Thu, 25 Sep 2025 13:25:15 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-26 20:58:12.931191
Title: ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
Title（参考訳）: ToMPO:マルチエージェントの視点からのLSM戦略決定の訓練
Authors: Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng,
Abstract要約: 大規模言語モデル(LLM)は複雑なシナリオでの意思決定に使われてきた。本稿では,他の個別戦略の認識とゲーム状況の傾向を最適化するToMPOアルゴリズムを提案する。 ToMPOアルゴリズムは、モデル出力のコンプライアンスと協調的な結果の点でGRPO法を35%上回る。
参考スコア（独自算出の注目度）: 16.275962506416064
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose **T**heory **o**f **M**ind **P**olicy **O**ptimization **(ToMPO)** algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM's strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model's strategic decision-making capabilities.
Abstract（参考訳）: 大きな言語モデル(LLM)は複雑なシナリオで決定するために使われてきた。既存の多くの研究は、社会的タスクやシミュレートされた環境におけるマルチラウンド会話にのみ焦点をあてており、様々なタイプの意思決定と相互依存を無視している。現在の強化学習手法は、訓練中に他人の戦略を考えるのに苦労している。これらの問題に対処するために、まず2種類の意思決定と時間的依存関係を含む戦略的意思決定問題を定義します。さらに,*T**heory **o*f **M**ind **P**olicy **O**ptimization **(ToMPO)**アルゴリズムを提案する。 Group Relative Policy Optimization (GRPO) アルゴリズムと比較して,ToMPO は LLM の戦略決定を主に下記のように強化する。 1)他者の戦略の推論に基づくロールアウトの生成 2)グラフレベルとサンプルレベルの両方の利点を推定する。 3)グローバルと部分的な報酬のバランスをとること。 ToMPOアルゴリズムは、モデル出力のコンプライアンスと協調的な結果の点でGRPO法を35%上回る。さらに、パラメータサイズが100倍のモデルと比較すると、18%改善されている。これは、モデルの戦略的意思決定能力を高める上で、ToMPOアルゴリズムの有効性を示す。

論文の概要: ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective

関連論文リスト