Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games
- URL: http://arxiv.org/abs/2505.16401v3
- Date: Sun, 01 Jun 2025 14:49:25 GMT
- Title: Divide-Fuse-Conquer: Eliciting "Aha Moments" in Multi-Scenario Games
- Authors: Xiaoqing Zhang, Huabin Zheng, Ang Lv, Yuhan Liu, Zirui Song, Flood Sung, Xiuying Chen, Rui Yan,
- Abstract summary: Large language models (LLMs) have been observed to suddenly exhibit advanced reasoning abilities during reinforcement learning (RL)<n>We propose Divide-Fuse-Conquer, a framework designed to enhance generalization in multi-scenario RL.
- Score: 36.16284323379845
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large language models (LLMs) have been observed to suddenly exhibit advanced reasoning abilities during reinforcement learning (RL), resembling an ``aha moment'' triggered by simple outcome-based rewards. While RL has proven effective in eliciting such breakthroughs in tasks involving mathematics, coding, and vision, it faces significant challenges in multi-scenario games. The diversity of game rules, interaction modes, and environmental complexities often leads to policies that perform well in one scenario but fail to generalize to others. Simply combining multiple scenarios during training introduces additional challenges, such as training instability and poor performance. To overcome these challenges, we propose Divide-Fuse-Conquer, a framework designed to enhance generalization in multi-scenario RL. This approach starts by heuristically grouping games based on characteristics such as rules and difficulties. Specialized models are then trained for each group to excel at games in the group is what we refer to as the divide step. Next, we fuse model parameters from different groups as a new model, and continue training it for multiple groups, until the scenarios in all groups are conquered. Experiments across 18 TextArena games show that Qwen2.5-32B-Align trained with the Divide-Fuse-Conquer strategy reaches a performance level comparable to Claude3.5, achieving 7 wins and 4 draws. We hope our approach can inspire future research on using reinforcement learning to improve the generalization of LLMs.
Related papers
- A Benchmark for Generalizing Across Diverse Team Strategies in Competitive Pokémon [31.012853711707965]
Pok'emon Video Game Championships (VGC) is a domain with an extraordinarily large space of possible team configurations.<n>We introduce VGC-Bench: a benchmark that provides critical infrastructure, standardizes evaluation protocols, and supplies human-play datasets.<n>In the restricted setting where an agent is trained and evaluated on a single-team configuration, our methods are able to win against a professional VGC competitor.
arXiv Detail & Related papers (2025-06-12T03:19:39Z) - Play to Generalize: Learning to Reason Through Game Play [11.778612579151067]
We propose a novel post-training paradigm, Visual Game Learning, where MLLMs develop out-of-domain generalization of multimodal reasoning through playing arcade-like games.<n>Our findings suggest a new post-training paradigm: synthetic, rule-based games can serve as controllable and scalable pre-text tasks.
arXiv Detail & Related papers (2025-06-09T17:59:57Z) - G1: Bootstrapping Perception and Reasoning Abilities of Vision-Language Model via Reinforcement Learning [37.18982308118744]
Vision-Language Models (VLMs) excel in many direct multimodal tasks but struggle to translate this prowess into effective decision-making within interactive, visually rich environments like games.<n>We introduce VLM-Gym, a curated reinforcement learning environment featuring diverse visual games with unified interfaces and adjustable, compositional difficulty.<n>We train G0 models using pure RL-driven self-evolution, which demonstrate emergent perception and reasoning patterns.<n>Our resulting G1 models consistently surpass their teacher across all games and outperform leading proprietary models like Claude-3.7-Sonnet-Thinking.
arXiv Detail & Related papers (2025-05-19T17:54:39Z) - RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning [125.96848846966087]
Training large language models (LLMs) as interactive agents presents unique challenges.<n>While reinforcement learning has enabled progress in static tasks, multi-turn agent RL training remains underexplored.<n>We propose StarPO, a general framework for trajectory-level agent RL, and introduce RAGEN, a modular system for training and evaluating LLM agents.
arXiv Detail & Related papers (2025-04-24T17:57:08Z) - How Far Are We on the Decision-Making of LLMs? Evaluating LLMs' Gaming Ability in Multi-Agent Environments [83.78240828340681]
GAMA($gamma$)-Bench is a new framework for evaluating Large Language Models' Gaming Ability in Multi-Agent environments.<n>$gamma$-Bench includes eight classical game theory scenarios and a dynamic scoring scheme specially designed to assess LLMs' performance.<n>Our results indicate GPT-3.5 demonstrates strong robustness but limited generalizability, which can be enhanced using methods like Chain-of-Thought.
arXiv Detail & Related papers (2024-03-18T14:04:47Z) - Balancing the AI Strength of Roles in Self-Play Training with Regret
Matching+ [1.5591858554014466]
A generalized model capable of controlling any character within the game presents a viable option.
This strategy not only conserves computational resources and time during the training phase but also reduces resource requirements during deployment.
A simple method is introduced based on Regret Matching+, which facilitates a more balanced performance of strength by the model when controlling various roles.
arXiv Detail & Related papers (2024-01-23T08:27:38Z) - Improving Generalization of Alignment with Human Preferences through
Group Invariant Learning [56.19242260613749]
Reinforcement Learning from Human Feedback (RLHF) enables the generation of responses more aligned with human preferences.
Previous work shows that Reinforcement Learning (RL) often exploits shortcuts to attain high rewards and overlooks challenging samples.
We propose a novel approach that can learn a consistent policy via RL across various data groups or domains.
arXiv Detail & Related papers (2023-10-18T13:54:15Z) - All by Myself: Learning Individualized Competitive Behaviour with a
Contrastive Reinforcement Learning optimization [57.615269148301515]
In a competitive game scenario, a set of agents have to learn decisions that maximize their goals and minimize their adversaries' goals at the same time.
We propose a novel model composed of three neural layers that learn a representation of a competitive game, learn how to map the strategy of specific opponents, and how to disrupt them.
Our experiments demonstrate that our model achieves better performance when playing against offline, online, and competitive-specific models, in particular when playing against the same opponent multiple times.
arXiv Detail & Related papers (2023-10-02T08:11:07Z) - Mimicking To Dominate: Imitation Learning Strategies for Success in
Multiagent Competitive Games [13.060023718506917]
We develop a new multi-agent imitation learning model for predicting next moves of the opponents.
We also present a new multi-agent reinforcement learning algorithm that combines our imitation learning model and policy training into one single training process.
Experimental results show that our approach achieves superior performance compared to existing state-of-the-art multi-agent RL algorithms.
arXiv Detail & Related papers (2023-08-20T07:30:13Z) - TiZero: Mastering Multi-Agent Football with Curriculum Learning and
Self-Play [19.98100026335148]
TiZero is a self-evolving, multi-agent system that learns from scratch.
It outperforms previous systems by a large margin on the Google Research Football environment.
arXiv Detail & Related papers (2023-02-15T08:19:18Z) - Multi-Game Decision Transformers [49.257185338595434]
We show that a single transformer-based model can play a suite of up to 46 Atari games simultaneously at close-to-human performance.
We compare several approaches in this multi-game setting, such as online and offline RL methods and behavioral cloning.
We find that our Multi-Game Decision Transformer models offer the best scalability and performance.
arXiv Detail & Related papers (2022-05-30T16:55:38Z) - Generating Diverse and Competitive Play-Styles for Strategy Games [58.896302717975445]
We propose Portfolio Monte Carlo Tree Search with Progressive Unpruning for playing a turn-based strategy game (Tribes)
We show how it can be parameterized so a quality-diversity algorithm (MAP-Elites) is used to achieve different play-styles while keeping a competitive level of play.
Our results show that this algorithm is capable of achieving these goals even for an extensive collection of game levels beyond those used for training.
arXiv Detail & Related papers (2021-04-17T20:33:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.