Remembering the Markov Property in Cooperative MARL
- URL: http://arxiv.org/abs/2507.18333v1
- Date: Thu, 24 Jul 2025 11:59:42 GMT
- Title: Remembering the Markov Property in Cooperative MARL
- Authors: Kale-ab Abebe Tessera, Leonard Hinckeldey, Riccardo Zamboni, David Abel, Amos Storkey,
- Abstract summary: Co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents.<n>Modern MARL environments may not adequately test the core assumptions of Dec-POMDPs.
- Score: 6.730957202419779
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cooperative multi-agent reinforcement learning (MARL) is typically formalised as a Decentralised Partially Observable Markov Decision Process (Dec-POMDP), where agents must reason about the environment and other agents' behaviour. In practice, current model-free MARL algorithms use simple recurrent function approximators to address the challenge of reasoning about others using partial information. In this position paper, we argue that the empirical success of these methods is not due to effective Markov signal recovery, but rather to learning simple conventions that bypass environment observations and memory. Through a targeted case study, we show that co-adapting agents can learn brittle conventions, which then fail when partnered with non-adaptive agents. Crucially, the same models can learn grounded policies when the task design necessitates it, revealing that the issue is not a fundamental limitation of the learning models but a failure of the benchmark design. Our analysis also suggests that modern MARL environments may not adequately test the core assumptions of Dec-POMDPs. We therefore advocate for new cooperative environments built upon two core principles: (1) behaviours grounded in observations and (2) memory-based reasoning about other agents, ensuring success requires genuine skill rather than fragile, co-adapted agreements.
Related papers
- Probing Dec-POMDP Reasoning in Cooperative MARL [6.246549316580709]
We introduce a diagnostic suite combining statistically grounded performance comparisons and information-theoretic probes.<n>We audit the behavioural complexity of baseline policies across 37 scenarios spanning MPE, SMAX, Overcooked, Hanabi, and MaBrax.<n>Our diagnostics reveal that success on these benchmarks rarely requires genuine Dec-POMDP reasoning.
arXiv Detail & Related papers (2026-02-24T11:44:46Z) - Model-Based Reinforcement Learning Under Confounding [3.5690236380446163]
We investigate model-based reinforcement learning in contextual Markov decision processes (C-MDPs) in which the context is unobserved and induces confounding in the offline dataset.<n>We adapt a proximal off-policy evaluation approach that identifies the confounded reward expectation using only observable state-action-reward trajectories under mild invertibility conditions on proxy variables.<n>The proposed formulation enables principled model learning and planning in confounded environments where contextual information is unobserved, unavailable, or impractical to collect.
arXiv Detail & Related papers (2025-12-08T13:02:00Z) - The Curious Case of Analogies: Investigating Analogical Reasoning in Large Language Models [22.609819017261632]
Analogical reasoning is at the core of human cognition, serving as an important foundation for a variety of intellectual activities.<n>While prior work has shown that LLMs can represent task patterns and surface-level concepts, it remains unclear whether these models can encode high-level relational concepts.
arXiv Detail & Related papers (2025-11-25T14:23:58Z) - Causal Knowledge Transfer for Multi-Agent Reinforcement Learning in Dynamic Environments [1.2787026473187368]
Multi-agent reinforcement learning (MARL) has achieved notable success in environments where agents must learn coordinated behaviors.<n>Traditional knowledge transfer methods in MARL struggle to generalize, and agents often require costly retraining to adapt.<n>This paper introduces a causal knowledge transfer framework that enables RL agents to learn and share compact causal representations of paths within a non-stationary environment.
arXiv Detail & Related papers (2025-07-18T11:59:55Z) - Generalization in Monitored Markov Decision Processes (Mon-MDPs) [9.81003561034599]
In many real-world scenarios, rewards are not always observable, which can be modeled as a monitored Markov decision process (Mon-MDP)<n>This work explores Mon-MDPs using function approximation (FA) and investigates the challenges involved.<n>We show that combining function approximation with a learned reward model enables agents to generalize from monitored states with observable rewards, to unmonitored environment states with unobservable rewards.
arXiv Detail & Related papers (2025-05-13T21:58:25Z) - The Lessons of Developing Process Reward Models in Mathematical Reasoning [62.165534879284735]
Process Reward Models (PRMs) aim to identify and mitigate intermediate errors in the reasoning processes.<n>We develop a consensus filtering mechanism that effectively integrates Monte Carlo (MC) estimation with Large Language Models (LLMs)<n>We release a new state-of-the-art PRM that outperforms existing open-source alternatives.
arXiv Detail & Related papers (2025-01-13T13:10:16Z) - Disentangling Memory and Reasoning Ability in Large Language Models [97.26827060106581]
We propose a new inference paradigm that decomposes the complex inference process into two distinct and clear actions.<n>Our experiment results show that this decomposition improves model performance and enhances the interpretability of the inference process.
arXiv Detail & Related papers (2024-11-20T17:55:38Z) - MR-Ben: A Meta-Reasoning Benchmark for Evaluating System-2 Thinking in LLMs [55.20845457594977]
Large language models (LLMs) have shown increasing capability in problem-solving and decision-making.<n>We present a process-based benchmark MR-Ben that demands a meta-reasoning skill.<n>Our meta-reasoning paradigm is especially suited for system-2 slow thinking.
arXiv Detail & Related papers (2024-06-20T03:50:23Z) - Towards Effective Evaluations and Comparisons for LLM Unlearning Methods [97.2995389188179]
This paper seeks to refine the evaluation of machine unlearning for large language models.<n>It addresses two key challenges -- the robustness of evaluation metrics and the trade-offs between competing goals.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Efficient Model-based Multi-agent Reinforcement Learning via Optimistic
Equilibrium Computation [93.52573037053449]
H-MARL (Hallucinated Multi-Agent Reinforcement Learning) learns successful equilibrium policies after a few interactions with the environment.
We demonstrate our approach experimentally on an autonomous driving simulation benchmark.
arXiv Detail & Related papers (2022-03-14T17:24:03Z) - The Value Equivalence Principle for Model-Based Reinforcement Learning [29.368870568214007]
We argue that the limited representational resources of model-based RL agents are better used to build models that are directly useful for value-based planning.
We show that, as we augment the set of policies and functions considered, the class of value equivalent models shrinks.
We argue that the principle of value equivalence underlies a number of recent empirical successes in RL.
arXiv Detail & Related papers (2020-11-06T18:25:54Z) - REMAX: Relational Representation for Multi-Agent Exploration [13.363887960136102]
We propose a learning-based exploration strategy to generate the initial states of a game.
We demonstrate that our method improves the training and performance of the MARL model more than the existing exploration methods.
arXiv Detail & Related papers (2020-08-12T10:23:35Z) - Off-policy Evaluation in Infinite-Horizon Reinforcement Learning with
Latent Confounders [62.54431888432302]
We study an OPE problem in an infinite-horizon, ergodic Markov decision process with unobserved confounders.
We show how, given only a latent variable model for states and actions, policy value can be identified from off-policy data.
arXiv Detail & Related papers (2020-07-27T22:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.