Related papers: Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2411.11099v1
Date: Sun, 17 Nov 2024 15:00:39 GMT
Title: Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning
Authors: Ting Zhu, Yue Jin, Jeremie Houssineau, Giovanni Montana,
Abstract summary: We introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.
Score: 11.988291170853806
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

Related papers

Heterogeneous Agent Collaborative Reinforcement Learning [52.99813668995983]
Heterogeneous Agent Collaborative Reinforcement Learning (HACRL)<n>Building on this paradigm, we propose HACPO, a collaborative RL algorithm that enables principled rollout sharing to maximize sample utilization and cross-agent knowledge transfer.<n>Experiments across diverse heterogeneous model combinations and reasoning benchmarks show that HACPO consistently improves all participating agents, outperforming GSPO by an average of 3.3% while using only half the rollout cost.
arXiv Detail & Related papers (2026-03-03T05:09:49Z)
CARL: Critical Action Focused Reinforcement Learning for Multi-Step Agent [53.56274149236814]
We propose CARL, a critical-action-focused reinforcement learning algorithm tailored for multi-step agents.<n>Carl achieves both stronger performance and higher efficiency during training and inference across diverse evaluation settings.
arXiv Detail & Related papers (2025-12-04T16:15:46Z)
Evaluating Generalization Capabilities of LLM-Based Agents in Mixed-Motive Scenarios Using Concordia [100.74015791021044]
Large Language Model (LLM) agents have demonstrated impressive capabilities for social interaction.<n>Existing evaluation methods fail to measure how well these capabilities generalize to novel social situations.<n>We present empirical results from the NeurIPS 2024 Concordia Contest, where agents were evaluated on their ability to achieve mutual gains.
arXiv Detail & Related papers (2025-12-03T00:11:05Z)
Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization [80.09112808413133]
Mujica is a planner that decomposes questions into acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning.<n>MyGO is a novel reinforcement learning method that replaces traditional policy updates with gradient Likelihood Maximum Estimation.<n> Empirical results across multiple datasets demonstrate the effectiveness of MujicaMyGO in enhancing multi-hop QA performance.
arXiv Detail & Related papers (2025-05-20T18:33:03Z)
From Novice to Expert: LLM Agent Policy Optimization via Step-wise Reinforcement Learning [62.54484062185869]
We introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. We propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment.
arXiv Detail & Related papers (2024-11-06T10:35:11Z)
Process Reward Model with Q-Value Rankings [18.907163177605607]
Process Reward Modeling (PRM) is critical for complex reasoning and decision-making tasks. We introduce the Process Q-value Model (PQM), a novel framework that redefines PRM in the context of a Markov Decision Process. PQM optimize Q-value rankings based on a novel comparative loss function, enhancing the model's ability to capture the intricate dynamics among sequential decisions.
arXiv Detail & Related papers (2024-10-15T05:10:34Z)
POWQMIX: Weighted Value Factorization with Potentially Optimal Joint Actions Recognition for Cooperative Multi-Agent Reinforcement Learning [17.644279061872442]
Value function factorization methods are commonly used in cooperative multi-agent reinforcement learning. We propose the Potentially Optimal Joint Actions Weighted Qmix (POWQmix) algorithm, which recognizes the potentially optimal joint actions and assigns higher weights to the corresponding losses during training. Experiments in matrix games, difficulty-enhanced predator-prey, and StarCraft II Multi-Agent Challenge environments demonstrate that our algorithm outperforms the state-of-the-art value-based multi-agent reinforcement learning methods.
arXiv Detail & Related papers (2024-05-13T03:27:35Z)
LOQA: Learning with Opponent Q-Learning Awareness [1.1666234644810896]
We introduce Learning with Opponent Q-Learning Awareness (LOQA), a decentralized reinforcement learning algorithm tailored to optimize an agent's individual utility. LOQA achieves state-of-the-art performance in benchmark scenarios such as the Iterated Prisoner's Dilemma and the Coin Game.
arXiv Detail & Related papers (2024-05-02T06:33:01Z)
ROMA-iQSS: An Objective Alignment Approach via State-Based Value Learning and ROund-Robin Multi-Agent Scheduling [44.276285521929424]
We introduce a decentralized state-based value learning algorithm that enables agents to independently discover optimal states. Our theoretical analysis shows that our approach leads decentralized agents to an optimal collective policy. Empirical experiments further demonstrate that our method outperforms existing decentralized state-based and action-based value learning strategies.
arXiv Detail & Related papers (2024-04-05T09:39:47Z)
Let's reward step by step: Step-Level reward model as the Navigators for Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase. We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs. To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z)
Quantifying Agent Interaction in Multi-agent Reinforcement Learning for Cost-efficient Generalization [63.554226552130054]
Generalization poses a significant challenge in Multi-agent Reinforcement Learning (MARL) The extent to which an agent is influenced by unseen co-players depends on the agent's policy and the specific scenario. We present the Level of Influence (LoI), a metric quantifying the interaction intensity among agents within a given scenario and environment.
arXiv Detail & Related papers (2023-10-11T06:09:26Z)
Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z)
Residual Q-Networks for Value Function Factorizing in Multi-Agent Reinforcement Learning [0.0]
We propose a novel concept of Residual Q-Networks (RQNs) for Multi-Agent Reinforcement Learning (MARL) The RQN learns to transform the individual Q-value trajectories in a way that preserves the Individual-Global-Max criteria (IGM) The proposed method converges faster, with increased stability and shows robust performance in a wider family of environments.
arXiv Detail & Related papers (2022-05-30T16:56:06Z)
A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation [61.740187363451746]
Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
arXiv Detail & Related papers (2021-06-12T20:21:38Z)
Model-based Multi-agent Policy Optimization with Adaptive Opponent-wise Rollouts [52.844741540236285]
This paper investigates the model-based methods in multi-agent reinforcement learning (MARL) We propose a novel decentralized model-based MARL method, named Adaptive Opponent-wise Rollout Policy (AORPO)
arXiv Detail & Related papers (2021-05-07T16:20:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.