Actions Speak What You Want: Provably Sample-Efficient Reinforcement
Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks
- URL: http://arxiv.org/abs/2307.14085v1
- Date: Wed, 26 Jul 2023 10:24:17 GMT
- Title: Actions Speak What You Want: Provably Sample-Efficient Reinforcement
Learning of the Quantal Stackelberg Equilibrium from Strategic Feedbacks
- Authors: Siyu Chen, Mengdi Wang, Zhuoran Yang
- Abstract summary: We study reinforcement learning for learning a Quantal Stackelberg Equilibrium (QSE) in an episodic Markov game with a leader-follower structure.
Our algorithms are based on (i) learning the quantal response model via maximum likelihood estimation and (ii) model-free or model-based RL for solving the leader's decision making problem.
- Score: 94.07688076435818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study reinforcement learning (RL) for learning a Quantal Stackelberg
Equilibrium (QSE) in an episodic Markov game with a leader-follower structure.
In specific, at the outset of the game, the leader announces her policy to the
follower and commits to it. The follower observes the leader's policy and, in
turn, adopts a quantal response policy by solving an entropy-regularized policy
optimization problem induced by leader's policy. The goal of the leader is to
find her optimal policy, which yields the optimal expected total return, by
interacting with the follower and learning from data. A key challenge of this
problem is that the leader cannot observe the follower's reward, and needs to
infer the follower's quantal response model from his actions against leader's
policies. We propose sample-efficient algorithms for both the online and
offline settings, in the context of function approximation. Our algorithms are
based on (i) learning the quantal response model via maximum likelihood
estimation and (ii) model-free or model-based RL for solving the leader's
decision making problem, and we show that they achieve sublinear regret upper
bounds. Moreover, we quantify the uncertainty of these estimators and leverage
the uncertainty to implement optimistic and pessimistic algorithms for online
and offline settings. Besides, when specialized to the linear and myopic
setting, our algorithms are also computationally efficient. Our theoretical
analysis features a novel performance-difference lemma which incorporates the
error of quantal response model, which might be of independent interest.
Related papers
- Nash Learning from Human Feedback [86.09617990412941]
We introduce an alternative pipeline for the fine-tuning of large language models using pairwise human feedback.
We term this approach Nash learning from human feedback (NLHF)
We present a novel algorithmic solution, Nash-MD, founded on the principles of mirror descent.
arXiv Detail & Related papers (2023-12-01T19:26:23Z) - Stackelberg Batch Policy Learning [3.5426153040167754]
Batch reinforcement learning (RL) defines the task of learning from a fixed batch of data lacking exhaustive exploration.
Worst-case optimality algorithms, which calibrate a value-function model class from logged experience, have emerged as a promising paradigm for batch RL.
We propose a novel gradient-based learning algorithm: StackelbergLearner, in which the leader player updates according to the total derivative of its objective instead of the usual individual gradient.
arXiv Detail & Related papers (2023-09-28T06:18:34Z) - Value-Distributional Model-Based Reinforcement Learning [59.758009422067]
Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks.
We study the problem from a model-based Bayesian reinforcement learning perspective.
We propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function.
arXiv Detail & Related papers (2023-08-12T14:59:19Z) - Reinforcement Learning with Human Feedback: Learning Dynamic Choices via
Pessimism [91.52263068880484]
We study offline Reinforcement Learning with Human Feedback (RLHF)
We aim to learn the human's underlying reward and the MDP's optimal policy from a set of trajectories induced by human choices.
RLHF is challenging for multiple reasons: large state space but limited human feedback, the bounded rationality of human decisions, and the off-policy distribution shift.
arXiv Detail & Related papers (2023-05-29T01:18:39Z) - Follower Agnostic Methods for Stackelberg Games [14.143502615941648]
We present an efficient algorithm to solve online Stackelberg games, featuring multiple followers, in a follower-agnostic manner.
Our approach works even when leader has no knowledge about the followers' utility functions or strategy space.
arXiv Detail & Related papers (2023-02-02T21:21:14Z) - No-Regret Learning in Dynamic Stackelberg Games [31.001205916012307]
In a Stackelberg game, a leader commits to a randomized strategy, and a follower chooses their best strategy in response.
We consider an extension of a standard Stackelberg game, called a discrete-time dynamic Stackelberg game, that has an underlying state space that affects the leader's rewards and available strategies and evolves in a Markovian manner depending on both the leader and follower's selected strategies.
arXiv Detail & Related papers (2022-02-10T01:07:57Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline
and Online RL [48.552287941528]
Off-policy reinforcement learning holds the promise of sample-efficient learning of decision-making policies.
In the offline RL setting, standard off-policy RL methods can significantly underperform.
We introduce Expected-Max Q-Learning (EMaQ), which is more closely related to the resulting practical algorithm.
arXiv Detail & Related papers (2020-07-21T21:13:02Z) - Model-free Reinforcement Learning for Stochastic Stackelberg Security
Games [7.470839530834359]
We consider a sequential Stackelberg game with two players, a leader and a follower.
The follower has access to the state of the system while the leader does not.
We propose an RL algorithm based on Expected Sarsa that learns the Stackelberg equilibrium policy by simulating a model of the MDP.
arXiv Detail & Related papers (2020-05-24T22:34:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.