Related papers: Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning

URL: http://arxiv.org/abs/2511.09792v1
Date: Fri, 14 Nov 2025 01:10:03 GMT
Title: Beyond Monotonicity: Revisiting Factorization Principles in Multi-Agent Q-Learning
Authors: Tianmeng Hu, Yongzheng Cui, Rui Tang, Biao Luo, Ke Li,
Abstract summary: Value decomposition is a central approach in multi-agent reinforcement learning (MARL)<n>Existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity.<n>We show that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines.
Score: 24.476713156225685
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Value decomposition is a central approach in multi-agent reinforcement learning (MARL), enabling centralized training with decentralized execution by factorizing the global value function into local values. To ensure individual-global-max (IGM) consistency, existing methods either enforce monotonicity constraints, which limit expressive power, or adopt softer surrogates at the cost of algorithmic complexity. In this work, we present a dynamical systems analysis of non-monotonic value decomposition, modeling learning dynamics as continuous-time gradient flow. We prove that, under approximately greedy exploration, all zero-loss equilibria violating IGM consistency are unstable saddle points, while only IGM-consistent solutions are stable attractors of the learning dynamics. Extensive experiments on both synthetic matrix games and challenging MARL benchmarks demonstrate that unconstrained, non-monotonic factorization reliably recovers IGM-optimal solutions and consistently outperforms monotonic baselines. Additionally, we investigate the influence of temporal-difference targets and exploration strategies, providing actionable insights for the design of future value-based MARL algorithms.

Related papers

Harnessing Consistency for Robust Test-Time LLM Ensemble [88.55393815158608]
CoRE is a plug-and-play technique that harnesses model consistency for robust LLM ensemble.<n> Token-level consistency captures fine-grained disagreements by applying a low-pass filter to downweight uncertain tokens.<n>Model-level consistency models global agreement by promoting model outputs with high self-confidence.
arXiv Detail & Related papers (2025-10-12T04:18:45Z)
WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training [64.0932926819307]
We present Warmup-Stable and Merge (WSM), a framework that establishes a formal connection between learning rate decay and model merging.<n>WSM provides a unified theoretical foundation for emulating various decay strategies.<n>Our framework consistently outperforms the widely-adopted Warmup-Stable-Decay (WSD) approach across multiple benchmarks.
arXiv Detail & Related papers (2025-07-23T16:02:06Z)
Offline Multi-agent Reinforcement Learning via Score Decomposition [51.23590397383217]
offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts.<n>This work is the first work to explicitly address the distributional gap between offline and online MARL.
arXiv Detail & Related papers (2025-05-09T11:42:31Z)
Small-Scale-Fading-Aware Resource Allocation in Wireless Federated Learning [27.931985523249352]
This paper proposes a small-scale-fading-aware resource allocation strategy using a multi-agent reinforcement learning (MARL) framework.<n>We establish a one-step convergence bound of the FL algorithm and formulate the resource allocation problem as a decentralized partially observable Markov decision process.<n>In our framework, each client serves as an agent that dynamically determines spectrum and power allocations within each coherence time slot.
arXiv Detail & Related papers (2025-05-06T13:41:59Z)
QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning? [6.0636611835869205]
Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL)<n>We propose a novel algorithm, textbfQLLM, which facilitates the automatic construction of credit assignment functions using large language models (LLMs)<n>Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-17T14:07:11Z)
Reinforcement Learning under Latent Dynamics: Toward Statistical and Algorithmic Modularity [51.40558987254471]
Real-world applications of reinforcement learning often involve environments where agents operate on complex, high-dimensional observations. This paper addresses the question of reinforcement learning under $textitgeneral$ latent dynamics from a statistical and algorithmic perspective.
arXiv Detail & Related papers (2024-10-23T14:22:49Z)
Finite-Time Convergence and Sample Complexity of Actor-Critic Multi-Objective Reinforcement Learning [20.491176017183044]
This paper tackles the multi-objective reinforcement learning (MORL) problem. It introduces an innovative actor-critic algorithm named MOAC which finds a policy by iteratively making trade-offs among conflicting reward signals.
arXiv Detail & Related papers (2024-05-05T23:52:57Z)
Sample-Efficient Robust Multi-Agent Reinforcement Learning in the Face of Environmental Uncertainty [40.55653383218379]
This work focuses on learning in distributionally robust Markov games (RMGs) We propose a sample-efficient model-based algorithm (DRNVI) with finite-sample complexity guarantees for learning robust variants of various notions of game-theoretic equilibria.
arXiv Detail & Related papers (2024-04-29T17:51:47Z)
Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z)
QFree: A Universal Value Function Factorization for Multi-Agent Reinforcement Learning [2.287186762346021]
We propose QFree, a universal value function factorization method for multi-agent reinforcement learning. We show that QFree achieves the state-of-the-art performance in a general-purpose complex MARL benchmark environment.
arXiv Detail & Related papers (2023-11-01T08:07:16Z)
Distributionally Robust Model-based Reinforcement Learning with Large State Spaces [55.14361269378122]
Three major challenges in reinforcement learning are the complex dynamical systems with large state spaces, the costly data acquisition processes, and the deviation of real-world dynamics from the training environment deployment. We study distributionally robust Markov decision processes with continuous state spaces under the widely used Kullback-Leibler, chi-square, and total variation uncertainty sets. We propose a model-based approach that utilizes Gaussian Processes and the maximum variance reduction algorithm to efficiently learn multi-output nominal transition dynamics.
arXiv Detail & Related papers (2023-09-05T13:42:11Z)
Efficient Model-based Multi-agent Reinforcement Learning via Optimistic Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment. We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z)
Trustworthy Multimodal Regression with Mixture of Normal-inverse Gamma Distributions [91.63716984911278]
We introduce a novel Mixture of Normal-Inverse Gamma distributions (MoNIG) algorithm, which efficiently estimates uncertainty in principle for adaptive integration of different modalities and produces a trustworthy regression result. Experimental results on both synthetic and different real-world data demonstrate the effectiveness and trustworthiness of our method on various multimodal regression tasks.
arXiv Detail & Related papers (2021-11-11T14:28:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.