Related papers: RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning

RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning

URL: http://arxiv.org/abs/2403.04344v1
Date: Thu, 7 Mar 2024 09:12:23 GMT
Title: RL-CFR: Improving Action Abstraction for Imperfect Information Extensive-Form Games with Reinforcement Learning
Authors: Boning Li, Zhixuan Fang and Longbo Huang
Abstract summary: We introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64pm 11$ and $84pm 17$ mbb/hand, respectively.
Score: 42.80561441946148
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effective action abstraction is crucial in tackling challenges associated with large action spaces in Imperfect Information Extensive-Form Games (IIEFGs). However, due to the vast state space and computational complexity in IIEFGs, existing methods often rely on fixed abstractions, resulting in sub-optimal performance. In response, we introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. The reward is defined as the expected payoff difference between the selected and default action abstractions. RL-CFR constructs a game tree with RL-guided action abstractions and utilizes counterfactual regret minimization (CFR) for strategy derivation. Impressively, it can be trained from scratch, achieving higher expected payoff without increased CFR solving time. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64\pm 11$ and $84\pm 17$ mbb/hand, respectively.

Related papers

Adaptive Action Duration with Contextual Bandits for Deep Reinforcement Learning in Dynamic Environments [11.705324423141606]
We propose a novel paradigm that integrates contextual bandits with Deep Reinforcement Learning (DRL)<n>Our approach augments a Deep Q-Network (DQN) with a contextual bandit module that learns to choose optimal action repetition rates based on state contexts.<n>Experiments on Atari 2600 games demonstrate significant performance improvements over static duration baselines.
arXiv Detail & Related papers (2025-06-17T20:04:53Z)
GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs) This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld. We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z)
Beyond Human Preferences: Exploring Reinforcement Learning Trajectory Evaluation and Improvement through LLMs [12.572869123617783]
Reinforcement learning (RL) faces challenges in evaluating policy trajectories within intricate game tasks. PbRL presents a pioneering framework that capitalizes on human preferences as pivotal reward signals. We propose a LLM-enabled automatic preference generation framework named LLM4PG.
arXiv Detail & Related papers (2024-06-28T04:21:24Z)
EventRL: Enhancing Event Extraction with Outcome Supervision for Large Language Models [48.136950450053476]
EventRL is a reinforcement learning approach developed to enhance event extraction for large language models (LLMs) We evaluate EventRL against existing methods like Few-Shot Prompting (FSP) and Supervised Fine-Tuning (SFT) Our findings show that EventRL significantly outperforms these conventional approaches by improving the performance in identifying and structuring events.
arXiv Detail & Related papers (2024-02-18T02:41:06Z)
Improving Large Language Models via Fine-grained Reinforcement Learning with Minimum Editing Constraint [104.53687944498155]
Reinforcement learning (RL) has been widely used in training large language models (LLMs) We propose a new RL method named RLMEC that incorporates a generative model as the reward model. Based on the generative reward model, we design the token-level RL objective for training and an imitation-based regularization for stabilizing RL process.
arXiv Detail & Related papers (2024-01-11T17:58:41Z)
CODEX: A Cluster-Based Method for Explainable Reinforcement Learning [0.0]
We present a method that incorporates semantic clustering, which can effectively summarize RL agent behavior in the state-action space. Experiments on the MiniGrid and StarCraft II gaming environments reveal the semantic clusters retain temporal as well as entity information.
arXiv Detail & Related papers (2023-12-07T11:04:37Z)
Action-Quantized Offline Reinforcement Learning for Robotic Skill Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data. In this paper, we propose an adaptive scheme for action quantization. We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z)
Data-Driven Evaluation of Training Action Space for Reinforcement Learning [1.370633147306388]
This paper proposes a Shapley-inspired methodology for training action space categorization and ranking. To reduce exponential-time shapley computations, the methodology includes a Monte Carlo simulation. The proposed data-driven methodology is RL to different domains, use cases, and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-08T04:53:43Z)
A Simple Reward-free Approach to Constrained Reinforcement Learning [33.813302183231556]
This paper bridges reward-free RL and constrained RL. Particularly, we propose a simple meta-algorithm such that given any reward-free RL oracle, the approachability and constrained RL problems can be directly solved with negligible overheads in sample complexity.
arXiv Detail & Related papers (2021-07-12T06:27:30Z)
Residual Reinforcement Learning from Demonstrations [51.56457466788513]
Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal. We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations. Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning.
arXiv Detail & Related papers (2021-06-15T11:16:49Z)
RLCFR: Minimize Counterfactual Regret by Deep Reinforcement Learning [15.126468724917288]
We propose a framework, RLCFR, which aims at improving the generalization ability of the CFR method. In the RLCFR, the game strategy is solved by the CFR in a reinforcement learning framework. Our method, RLCFR, then learns a policy to select the appropriate way of regret updating in the process of iteration.
arXiv Detail & Related papers (2020-09-10T14:20:33Z)
Learning Abstract Models for Strategic Exploration and Fast Reward Transfer [85.19766065886422]
We learn an accurate Markov Decision Process (MDP) over abstract states to avoid compounding errors. Our approach achieves strong results on three of the hardest Arcade Learning Environment games. We can reuse the learned abstract MDP for new reward functions, achieving higher reward in 1000x fewer samples than model-free methods trained from scratch.
arXiv Detail & Related papers (2020-07-12T03:33:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.