Sequential Information Design: Markov Persuasion Process and Its
Efficient Reinforcement Learning
- URL: http://arxiv.org/abs/2202.10678v1
- Date: Tue, 22 Feb 2022 05:41:43 GMT
- Title: Sequential Information Design: Markov Persuasion Process and Its
Efficient Reinforcement Learning
- Authors: Jibang Wu, Zixuan Zhang, Zhe Feng, Zhaoran Wang, Zhuoran Yang, Michael
I. Jordan, Haifeng Xu
- Abstract summary: This paper proposes a novel model of sequential information design, namely the Markov persuasion processes (MPPs)
Planning in MPPs faces the unique challenge in finding a signaling policy that is simultaneously persuasive to the myopic receivers and inducing the optimal long-term cumulative utilities of the sender.
We design a provably efficient no-regret learning algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which features a novel combination of both optimism and pessimism principles.
- Score: 156.5667417159582
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In today's economy, it becomes important for Internet platforms to consider
the sequential information design problem to align its long term interest with
incentives of the gig service providers. This paper proposes a novel model of
sequential information design, namely the Markov persuasion processes (MPPs),
where a sender, with informational advantage, seeks to persuade a stream of
myopic receivers to take actions that maximizes the sender's cumulative
utilities in a finite horizon Markovian environment with varying prior and
utility functions. Planning in MPPs thus faces the unique challenge in finding
a signaling policy that is simultaneously persuasive to the myopic receivers
and inducing the optimal long-term cumulative utilities of the sender.
Nevertheless, in the population level where the model is known, it turns out
that we can efficiently determine the optimal (resp. $\epsilon$-optimal) policy
with finite (resp. infinite) states and outcomes, through a modified
formulation of the Bellman equation.
Our main technical contribution is to study the MPP under the online
reinforcement learning (RL) setting, where the goal is to learn the optimal
signaling policy by interacting with with the underlying MPP, without the
knowledge of the sender's utility functions, prior distributions, and the
Markov transition kernels. We design a provably efficient no-regret learning
algorithm, the Optimism-Pessimism Principle for Persuasion Process (OP4), which
features a novel combination of both optimism and pessimism principles. Our
algorithm enjoys sample efficiency by achieving a sublinear $\sqrt{T}$-regret
upper bound. Furthermore, both our algorithm and theory can be applied to MPPs
with large space of outcomes and states via function approximation, and we
showcase such a success under the linear setting.
Related papers
- Efficient Learning of POMDPs with Known Observation Model in Average-Reward Setting [56.92178753201331]
We propose the Observation-Aware Spectral (OAS) estimation technique, which enables the POMDP parameters to be learned from samples collected using a belief-based policy.
We show the consistency of the OAS procedure, and we prove a regret guarantee of order $mathcalO(sqrtT log(T)$ for the proposed OAS-UCRL algorithm.
arXiv Detail & Related papers (2024-10-02T08:46:34Z) - Evaluation of Prosumer Networks for Peak Load Management in Iran: A Distributed Contextual Stochastic Optimization Approach [0.0]
This paper introduces a novel prosumers network framework aimed at mitigating peak loads in Iran.
A cost-oriented integrated prediction and optimization approach is proposed, empowering prosumers to make informed decisions.
Numerical results highlights that integrating prediction with optimization and implementing a contextual information-sharing network among prosumers significantly reduces peak loads as well as total costs.
arXiv Detail & Related papers (2024-08-31T16:09:38Z) - DPO Meets PPO: Reinforced Token Optimization for RLHF [36.97894955691627]
We introduce a framework that models RLHF problems as a Markov decision process (MDP)
Under this framework, we introduce an algorithm, dubbed as Reinforced Token Optimization (textttRTO), which learns the token-wise reward function from preference data.
For its practical implementation, textttRTO innovatively integrates Direct Preference Optimization (DPO) and Proximal Policy Optimization.
arXiv Detail & Related papers (2024-04-29T17:58:30Z) - Let's reward step by step: Step-Level reward model as the Navigators for
Reasoning [64.27898739929734]
Process-Supervised Reward Model (PRM) furnishes LLMs with step-by-step feedback during the training phase.
We propose a greedy search algorithm that employs the step-level feedback from PRM to optimize the reasoning pathways explored by LLMs.
To explore the versatility of our approach, we develop a novel method to automatically generate step-level reward dataset for coding tasks and observed similar improved performance in the code generation tasks.
arXiv Detail & Related papers (2023-10-16T05:21:50Z) - Achieving Fairness in Multi-Agent Markov Decision Processes Using
Reinforcement Learning [30.605881670761853]
We propose a Reinforcement Learning approach to achieve fairness in finite-horizon episodic MDPs.
We show that such an approach achieves sub-linear regret in terms of the number of episodes.
arXiv Detail & Related papers (2023-06-01T03:43:53Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Making Linear MDPs Practical via Contrastive Representation Learning [101.75885788118131]
It is common to address the curse of dimensionality in Markov decision processes (MDPs) by exploiting low-rank representations.
We consider an alternative definition of linear MDPs that automatically ensures normalization while allowing efficient representation learning.
We demonstrate superior performance over existing state-of-the-art model-based and model-free algorithms on several benchmarks.
arXiv Detail & Related papers (2022-07-14T18:18:02Z) - Stochastic convex optimization for provably efficient apprenticeship
learning [1.0609815608017066]
We consider large-scale Markov decision processes (MDPs) with an unknown cost function.
We employ convex optimization tools to address the problem of imitation learning, which consists of learning a policy from a finite set of expert demonstrations.
arXiv Detail & Related papers (2021-12-31T19:47:57Z) - APS: Active Pretraining with Successor Features [96.24533716878055]
We show that by reinterpreting and combining successorcitepHansenFast with non entropy, the intractable mutual information can be efficiently optimized.
The proposed method Active Pretraining with Successor Feature (APS) explores the environment via non entropy, and the explored data can be efficiently leveraged to learn behavior.
arXiv Detail & Related papers (2021-08-31T16:30:35Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.