RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning
- URL: http://arxiv.org/abs/2009.06373v1
- Date: Thu, 10 Sep 2020 14:20:33 GMT
- Title: RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning
- Authors: Huale Li, Xuan Wang, Fengwei Jia, Yifan Li, Yulin Wu, Jiajia Zhang,
Shuhan Qi
- Abstract summary: We propose a framework, RLCFR, which aims at improving the generalization ability of the CFR method.
In the RLCFR, the game strategy is solved by the CFR in a reinforcement learning framework.
Our method, RLCFR, then learns a policy to select the appropriate way of regret updating in the process of iteration.
- Score: 15.126468724917288
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Counterfactual regret minimization (CFR) is a popular method to deal with
decision-making problems of two-player zero-sum games with imperfect
information. Unlike existing studies that mostly explore for solving larger
scale problems or accelerating solution efficiency, we propose a framework,
RLCFR, which aims at improving the generalization ability of the CFR method. In
the RLCFR, the game strategy is solved by the CFR in a reinforcement learning
framework. And the dynamic procedure of iterative interactive strategy updating
is modeled as a Markov decision process (MDP). Our method, RLCFR, then learns a
policy to select the appropriate way of regret updating in the process of
iteration. In addition, a stepwise reward function is formulated to learn the
action policy, which is proportional to how well the iteration strategy is at
each step. Extensive experimental results on various games have shown that the
generalization ability of our method is significantly improved compared with
existing state-of-the-art methods.
Related papers
- Application of linear regression and quasi-Newton methods to the deep reinforcement learning in continuous action cases [0.0]
The Least Squares Deep Q Network (LS-DQN) method was proposed by Levine et al.
We propose the Double Least Squares Deep Deterministic Policy Gradient (DLS-DDPG) method to address this limitation.
Numerical experiments conducted in MuJoCo environments showed that the proposed method improved performance at least in some tasks.
arXiv Detail & Related papers (2025-03-19T08:10:54Z) - Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [0.9549646359252346]
We propose dynamic Learning Rate for deep Reinforcement Learning (LRRL)
LRRL is a meta-learning approach that selects the learning rate based on the agent's performance during training.
Our empirical results demonstrate that LRRL can substantially improve the performance of deep RL algorithms.
arXiv Detail & Related papers (2024-10-16T14:15:28Z) - Burning RED: Unlocking Subtask-Driven Reinforcement Learning and Risk-Awareness in Average-Reward Markov Decision Processes [7.028778922533688]
Average-reward Markov decision processes (MDPs) provide a foundational framework for sequential decision-making under uncertainty.
We study a unique structural property of average-reward MDPs and utilize it to introduce Reward-Extended Differential (or RED) reinforcement learning.
arXiv Detail & Related papers (2024-10-14T14:52:23Z) - Warm-up Free Policy Optimization: Improved Regret in Linear Markov Decision Processes [12.76843681997386]
Policy Optimization (PO) methods are among the most popular Reinforcement Learning (RL) algorithms in practice.
This paper proposes a PO-based algorithm with rate-optimal regret guarantees under the linear Markov Decision Process (MDP) model.
Our algorithm achieves regret with improved dependence on the other parameters of the problem.
arXiv Detail & Related papers (2024-07-03T12:36:24Z) - ReconBoost: Boosting Can Achieve Modality Reconcilement [89.4377895465204]
We study the modality-alternating learning paradigm to achieve reconcilement.
We propose a new method called ReconBoost to update a fixed modality each time.
We show that the proposed method resembles Friedman's Gradient-Boosting (GB) algorithm, where the updated learner can correct errors made by others.
arXiv Detail & Related papers (2024-05-15T13:22:39Z) - Hierarchical Deep Counterfactual Regret Minimization [53.86223883060367]
In this paper, we introduce the first hierarchical version of Deep CFR, an innovative method that boosts learning efficiency in tasks involving extensively large state spaces and deep game trees.
A notable advantage of HDCFR over previous works is its ability to facilitate learning with predefined (human) expertise and foster the acquisition of skills that can be transferred to similar tasks.
arXiv Detail & Related papers (2023-05-27T02:05:41Z) - Reinforcement Learning Methods for Wordle: A POMDP/Adaptive Control
Approach [0.3093890460224435]
We address the solution of the popular Wordle puzzle, using new reinforcement learning methods.
For the Wordle puzzle, they yield on-line solution strategies that are very close to optimal at relatively modest computational cost.
arXiv Detail & Related papers (2022-11-15T03:46:41Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Escaping from Zero Gradient: Revisiting Action-Constrained Reinforcement
Learning via Frank-Wolfe Policy Optimization [5.072893872296332]
Action-constrained reinforcement learning (RL) is a widely-used approach in various real-world applications.
We propose a learning algorithm that decouples the action constraints from the policy parameter update.
We show that the proposed algorithm significantly outperforms the benchmark methods on a variety of control tasks.
arXiv Detail & Related papers (2021-02-22T14:28:03Z) - Logistic Q-Learning [87.00813469969167]
We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs.
The main feature of our algorithm is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error.
arXiv Detail & Related papers (2020-10-21T17:14:31Z) - Discrete Action On-Policy Learning with Action-Value Critic [72.20609919995086]
Reinforcement learning (RL) in discrete action space is ubiquitous in real-world applications, but its complexity grows exponentially with the action-space dimension.
We construct a critic to estimate action-value functions, apply it on correlated actions, and combine these critic estimated action values to control the variance of gradient estimation.
These efforts result in a new discrete action on-policy RL algorithm that empirically outperforms related on-policy algorithms relying on variance control techniques.
arXiv Detail & Related papers (2020-02-10T04:23:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.