ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
- URL: http://arxiv.org/abs/2509.21134v1
- Date: Thu, 25 Sep 2025 13:25:15 GMT
- Title: ToMPO: Training LLM Strategic Decision Making from a Multi-Agent Perspective
- Authors: Yiwen Zhang, Ziang Chen, Fanqi Kong, Yizhe Huang, Xue Feng,
- Abstract summary: Large Language Models (LLMs) have been used to make decisions in complex scenarios.<n>We propose a ToMPO algorithm to optimize the perception of other individual strategies and the game situation trends.<n>The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes.
- Score: 16.275962506416064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) have been used to make decisions in complex scenarios, where they need models to think deeply, reason logically, and decide wisely. Many existing studies focus solely on multi-round conversations in social tasks or simulated environments, neglecting the various types of decisions and their interdependence. Current reinforcement learning methods struggle to consider the strategies of others during training. To address these issues, we first define a strategic decision-making problem that includes two types of decisions and their temporal dependencies. Furthermore, we propose **T**heory **o**f **M**ind **P**olicy **O**ptimization **(ToMPO)** algorithm to optimize the perception of other individual strategies and the game situation trends. Compared to the Group Relative Policy Optimization (GRPO) algorithm, ToMPO enhances the LLM's strategic decision-making mainly by: 1) generating rollouts based on reasoning the strategies of other individuals, 2) estimating advantages at both the graph-level and sample-level, and 3) balancing global and partial rewards. The ToMPO algorithm outperforms the GRPO method by 35% in terms of model output compliance and cooperative outcomes. Additionally, when compared to models with parameter sizes 100 times larger, it shows an 18% improvement. This demonstrates the effectiveness of the ToMPO algorithm in enhancing the model's strategic decision-making capabilities.
Related papers
- Strategy Executability in Mathematical Reasoning: Leveraging Human-Model Differences for Effective Guidance [86.46794021499511]
We show a previously underexplored gap between strategy usage and strategy executability.<n>We propose Selective Strategy Retrieval (SSR), a test-time framework that explicitly models executability.<n> SSR yields reliable and consistent improvements over direct solving, in-context learning, and single-source guidance.
arXiv Detail & Related papers (2026-02-26T03:34:23Z) - Plan before Solving: Problem-Aware Strategy Routing for Mathematical Reasoning with LLMs [49.995906301946]
Existing methods usually leverage a fixed strategy to guide Large Language Models (LLMs) to perform mathematical reasoning.<n>Our analysis reveals that the single strategy cannot adapt to problem-specific requirements and thus overlooks the trade-off between effectiveness and efficiency.<n>We propose Planning and Routing through Instance-Specific Modeling (PRISM), a novel framework that decouples mathematical reasoning into two stages: strategy planning and targeted execution.
arXiv Detail & Related papers (2025-09-29T07:22:41Z) - Feedback-Induced Performance Decline in LLM-Based Decision-Making [6.5990946334144756]
Large Language Models (LLMs) can extract context from natural language problem descriptions.<n>This paper studies the behaviour of these models within a Markov Decision Process (MDPs)
arXiv Detail & Related papers (2025-07-20T10:38:56Z) - Towards a Unified View of Preference Learning for Large Language Models: A Survey [88.66719962576005]
Large Language Models (LLMs) exhibit remarkably powerful capabilities.
One of the crucial factors to achieve success is aligning the LLM's output with human preferences.
We decompose all the strategies in preference learning into four components: model, data, feedback, and algorithm.
arXiv Detail & Related papers (2024-09-04T15:11:55Z) - From Bandits Model to Deep Deterministic Policy Gradient, Reinforcement
Learning with Contextual Information [4.42532447134568]
In this study, we use two methods to overcome the issue with contextual information.
In order to investigate strategic trading in quantitative markets, we merged the earlier financial trading strategy known as constant proportion portfolio insurance ( CPPI) into deep deterministic policy gradient (DDPG)
The experimental results show that both methods can accelerate the progress of reinforcement learning to obtain the optimal solution.
arXiv Detail & Related papers (2023-10-01T11:25:20Z) - On strategies for risk management and decision making under uncertainty shared across multiple fields [55.2480439325792]
The paper finds more than 110 examples of such strategies and this approach to risk is termed RDOT: Risk-reducing Design and Operations Toolkit.<n>RDOT strategies fall into six broad categories: structural, reactive, formal, adversarial, multi-stage and positive.<n>Overall, RDOT represents an overlooked class of versatile responses to uncertainty.
arXiv Detail & Related papers (2023-09-06T16:14:32Z) - A Machine Learning Approach to Two-Stage Adaptive Robust Optimization [6.943816076962257]
We propose an approach based on machine learning to solve two-stage linear adaptive robust optimization problems.
We encode the optimal here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the optimal wait-and-see decisions.
We train a machine learning model that predicts high-quality strategies for the here-and-now decisions, the worst-case scenarios associated with the optimal here-and-now decisions, and the wait-and-see decisions.
arXiv Detail & Related papers (2023-07-23T19:23:06Z) - A Reinforcement Learning-assisted Genetic Programming Algorithm for Team
Formation Problem Considering Person-Job Matching [70.28786574064694]
A reinforcement learning-assisted genetic programming algorithm (RL-GP) is proposed to enhance the quality of solutions.
The hyper-heuristic rules obtained through efficient learning can be utilized as decision-making aids when forming project teams.
arXiv Detail & Related papers (2023-04-08T14:32:12Z) - Learning MDPs from Features: Predict-Then-Optimize for Sequential
Decision Problems by Reinforcement Learning [52.74071439183113]
We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) solved via reinforcement learning.
Two significant computational challenges arise in applying decision-focused learning to MDPs.
arXiv Detail & Related papers (2021-06-06T23:53:31Z) - A Unifying Framework for Reinforcement Learning and Planning [2.564530030795554]
This paper presents a unifying algorithmic framework for reinforcement learning and planning (FRAP)
At the end of the paper, we compare a variety of well-known planning, model-free and model-based RL algorithms along these dimensions.
arXiv Detail & Related papers (2020-06-26T14:30:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.