Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning
- URL: http://arxiv.org/abs/2502.03506v1
- Date: Wed, 05 Feb 2025 12:06:54 GMT
- Title: Optimistic ε-Greedy Exploration for Cooperative Multi-Agent Reinforcement Learning
- Authors: Ruoning Zhang, Siying Wang, Wenyu Chen, Yang Zhou, Zhitong Zhao, Zixuan Zhang, Ruijie Zhang,
- Abstract summary: We propose Optimistic $epsilon$-Greedy Exploration, focusing on enhancing exploration to correct value estimations.
We introduce an optimistic updating network to identify optimal actions and sample actions from its distribution with a probability of $epsilon$ during exploration.
Experimental results in various environments reveal that the Optimistic $epsilon$-Greedy Exploration effectively prevents the algorithm from suboptimal solutions.
- Score: 16.049852176246038
- License:
- Abstract: The Centralized Training with Decentralized Execution (CTDE) paradigm is widely used in cooperative multi-agent reinforcement learning. However, due to the representational limitations of traditional monotonic value decomposition methods, algorithms can underestimate optimal actions, leading policies to suboptimal solutions. To address this challenge, we propose Optimistic $\epsilon$-Greedy Exploration, focusing on enhancing exploration to correct value estimations. The underestimation arises from insufficient sampling of optimal actions during exploration, as our analysis indicated. We introduce an optimistic updating network to identify optimal actions and sample actions from its distribution with a probability of $\epsilon$ during exploration, increasing the selection frequency of optimal actions. Experimental results in various environments reveal that the Optimistic $\epsilon$-Greedy Exploration effectively prevents the algorithm from suboptimal solutions and significantly improves its performance compared to other algorithms.
Related papers
- Deep Reinforcement Learning for Online Optimal Execution Strategies [49.1574468325115]
This paper tackles the challenge of learning non-Markovian optimal execution strategies in dynamic financial markets.
We introduce a novel actor-critic algorithm based on Deep Deterministic Policy Gradient (DDPG)
We show that our algorithm successfully approximates the optimal execution strategy.
arXiv Detail & Related papers (2024-10-17T12:38:08Z) - Batched Bayesian optimization with correlated candidate uncertainties [44.38372821900645]
We propose an acquisition strategy for discrete optimization motivated by pure exploitation, qPO (multipoint of Optimality)
We apply our method to the model-guided exploration of large chemical libraries and provide empirical evidence that it performs better than or on par with state-of-the-art methods in batched Bayesian optimization.
arXiv Detail & Related papers (2024-10-08T20:13:12Z) - Discovering Preference Optimization Algorithms with and for Large Language Models [50.843710797024805]
offline preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs.
We perform objective discovery to automatically discover new state-of-the-art preference optimization algorithms without (expert) human intervention.
Experiments demonstrate the state-of-the-art performance of DiscoPOP, a novel algorithm that adaptively blends logistic and exponential losses.
arXiv Detail & Related papers (2024-06-12T16:58:41Z) - Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer [52.09480867526656]
We identify the source of misalignment as a form of distributional shift and uncertainty in learning human preferences.
To mitigate overoptimization, we first propose a theoretical algorithm that chooses the best policy for an adversarially chosen reward model.
Using the equivalence between reward models and the corresponding optimal policy, the algorithm features a simple objective that combines a preference optimization loss and a supervised learning loss.
arXiv Detail & Related papers (2024-05-26T05:38:50Z) - Mastering the exploration-exploitation trade-off in Bayesian
Optimization [0.2538209532048867]
The acquisition function drives the choice of the next solution to evaluate, balancing between exploration and exploitation.
This paper proposes a novel acquisition function, mastering the trade-off between explorative and exploitative choices, adaptively.
arXiv Detail & Related papers (2023-05-15T13:19:03Z) - Learning Proximal Operators to Discover Multiple Optima [66.98045013486794]
We present an end-to-end method to learn the proximal operator across non-family problems.
We show that for weakly-ized objectives and under mild conditions, the method converges globally.
arXiv Detail & Related papers (2022-01-28T05:53:28Z) - Momentum Accelerates the Convergence of Stochastic AUPRC Maximization [80.8226518642952]
We study optimization of areas under precision-recall curves (AUPRC), which is widely used for imbalanced tasks.
We develop novel momentum methods with a better iteration of $O (1/epsilon4)$ for finding an $epsilon$stationary solution.
We also design a novel family of adaptive methods with the same complexity of $O (1/epsilon4)$, which enjoy faster convergence in practice.
arXiv Detail & Related papers (2021-07-02T16:21:52Z) - Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
We derive a novel policy gradient-style robust optimization approach, PG-BROIL, to balance expected performance and risk.
Results suggest PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse.
arXiv Detail & Related papers (2021-06-11T16:49:15Z) - An Efficient Algorithm for Deep Stochastic Contextual Bandits [10.298368632706817]
In contextual bandit problems, an agent selects an action based on certain observed context to maximize the reward over iterations.
Recently there have been a few studies using a deep neural network (DNN) to predict the expected reward for an action, and is trained by a gradient based method.
arXiv Detail & Related papers (2021-04-12T16:34:43Z) - Resource Aware Multifidelity Active Learning for Efficient Optimization [0.8717253904965373]
This paper introduces the Resource Aware Active Learning (RAAL) strategy to accelerate the optimization of black box functions.
The RAAL strategy optimally seeds multiple points at each allowing for a major speed up of the optimization task.
arXiv Detail & Related papers (2020-07-09T10:01:32Z) - Upper Trust Bound Feasibility Criterion for Mixed Constrained Bayesian
Optimization with Application to Aircraft Design [41.74498230885008]
We adapt the so-called super effcient global optimization algorithm to solve more accurately mixed constrained problems.
We show the good potential of the approach on a set of numerical experiments.
arXiv Detail & Related papers (2020-05-11T12:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.