Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation
- URL: http://arxiv.org/abs/2406.09068v3
- Date: Wed, 30 Oct 2024 12:08:43 GMT
- Title: Dispelling the Mirage of Progress in Offline MARL through Standardised Baselines and Evaluation
- Authors: Claude Formanek, Callum Rhys Tilbury, Louise Beyers, Jonathan Shock, Arnu Pretorius,
- Abstract summary: offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications.
The current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols.
- Score: 3.5490824406092405
- License:
- Abstract: Offline multi-agent reinforcement learning (MARL) is an emerging field with great promise for real-world applications. Unfortunately, the current state of research in offline MARL is plagued by inconsistencies in baselines and evaluation protocols, which ultimately makes it difficult to accurately assess progress, trust newly proposed innovations, and allow researchers to easily build upon prior work. In this paper, we firstly identify significant shortcomings in existing methodologies for measuring the performance of novel algorithms through a representative study of published offline MARL work. Secondly, by directly comparing to this prior work, we demonstrate that simple, well-implemented baselines can achieve state-of-the-art (SOTA) results across a wide range of tasks. Specifically, we show that on 35 out of 47 datasets used in prior work (almost 75% of cases), we match or surpass the performance of the current purported SOTA. Strikingly, our baselines often substantially outperform these more sophisticated algorithms. Finally, we correct for the shortcomings highlighted from this prior work by introducing a straightforward standardised methodology for evaluation and by providing our baseline implementations with statistically robust results across several scenarios, useful for comparisons in future work. Our proposal includes simple and sensible steps that are easy to adopt, which in combination with solid baselines and comparative results, could substantially improve the overall rigour of empirical science in offline MARL moving forward.
Related papers
- Coordination Failure in Cooperative Offline MARL [3.623224034411137]
We focus on coordination failure and investigate the role of joint actions in multi-agent policy gradients with offline data.
By using two-player games as an analytical tool, we demonstrate a simple yet overlooked failure mode of BRUD-based algorithms.
We propose an approach to mitigate such failure, by prioritising samples from the dataset based on joint-action similarity.
arXiv Detail & Related papers (2024-07-01T14:51:29Z) - Benchmarking Educational Program Repair [4.981275578987307]
Large language models (LLMs) can be used to generate learning resources, improve error messages, and provide feedback on code.
There is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches.
In this article, we propose a novel educational program repair benchmark.
arXiv Detail & Related papers (2024-05-08T18:23:59Z) - Simple Ingredients for Offline Reinforcement Learning [86.1988266277766]
offline reinforcement learning algorithms have proven effective on datasets highly connected to the target downstream task.
We show that existing methods struggle with diverse data: their performance considerably deteriorates as data collected for related but different tasks is simply added to the offline buffer.
We show that scale, more than algorithmic considerations, is the key factor influencing performance.
arXiv Detail & Related papers (2024-03-19T18:57:53Z) - How much can change in a year? Revisiting Evaluation in Multi-Agent
Reinforcement Learning [4.653136482223517]
We extend the database of evaluation methodology previously published by containing meta-data on MARL publications from top-rated conferences.
We compare the findings extracted from this updated database to the trends identified in their work.
We do observe a trend towards more difficult scenarios in SMAC-v1, which if continued into SMAC-v2 will encourage novel algorithmic development.
arXiv Detail & Related papers (2023-12-13T19:06:34Z) - Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis,
and LLMs Evaluations [111.88727295707454]
This paper reexamines the research on out-of-distribution (OOD) robustness in the field of NLP.
We propose a benchmark construction protocol that ensures clear differentiation and challenging distribution shifts.
We conduct experiments on pre-trained language models for analysis and evaluation of OOD robustness.
arXiv Detail & Related papers (2023-06-07T17:47:03Z) - Re-Evaluating LiDAR Scene Flow for Autonomous Driving [80.37947791534985]
Popular benchmarks for self-supervised LiDAR scene flow have unrealistic rates of dynamic motion, unrealistic correspondences, and unrealistic sampling patterns.
We evaluate a suite of top methods on a suite of real-world datasets.
We show that despite the emphasis placed on learning, most performance gains are caused by pre- and post-processing steps.
arXiv Detail & Related papers (2023-04-04T22:45:50Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - Towards a Standardised Performance Evaluation Protocol for Cooperative
MARL [2.2977300225306583]
Multi-agent reinforcement learning (MARL) has emerged as a useful approach to solving decentralised decision-making problems at scale.
We take a closer look at this rapid development with a focus on evaluation methodologies employed across a large body of research in cooperative MARL.
We propose a standardised performance evaluation protocol for cooperative MARL.
arXiv Detail & Related papers (2022-09-21T16:40:03Z) - Offline Stochastic Shortest Path: Learning, Evaluation and Towards
Optimality [57.91411772725183]
In this paper, we consider the offline shortest path problem when the state space and the action space are finite.
We design the simple value-based algorithms for tackling both offline policy evaluation (OPE) and offline policy learning tasks.
Our analysis of these simple algorithms yields strong instance-dependent bounds which can imply worst-case bounds that are near-minimax optimal.
arXiv Detail & Related papers (2022-06-10T07:44:56Z) - MS MARCO: Benchmarking Ranking Models in the Large-Data Regime [57.37239054770001]
This paper uses the MS MARCO and TREC Deep Learning Track as our case study.
We show how the design of the evaluation effort can encourage or discourage certain outcomes.
We provide some analysis of certain pitfalls, and a statement of best practices for avoiding such pitfalls.
arXiv Detail & Related papers (2021-05-09T20:57:36Z) - The Surprising Performance of Simple Baselines for Misinformation
Detection [4.060731229044571]
We examine the performance of a broad set of modern transformer-based language models.
We present our framework as a baseline for creating and evaluating new methods for misinformation detection.
arXiv Detail & Related papers (2021-04-14T16:25:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.