Markov Decision Process modeled with Bandits for Sequential Decision
Making in Linear-flow
- URL: http://arxiv.org/abs/2107.00204v1
- Date: Thu, 1 Jul 2021 03:54:36 GMT
- Title: Markov Decision Process modeled with Bandits for Sequential Decision
Making in Linear-flow
- Authors: Wenjun Zeng and Yi Liu
- Abstract summary: In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence.
We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix.
We observe the proposed MDP with Bandits algorithm outperforms Q-learning with $epsilon$-greedy and decreasing $epsilon$, independent Bandits, and interaction Bandits.
- Score: 73.1896399783641
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In membership/subscriber acquisition and retention, we sometimes need to
recommend marketing content for multiple pages in sequence. Different from
general sequential decision making process, the use cases have a simpler flow
where customers per seeing recommended content on each page can only return
feedback as moving forward in the process or dropping from it until a
termination state. We refer to this type of problems as sequential decision
making in linear--flow. We propose to formulate the problem as an MDP with
Bandits where Bandits are employed to model the transition probability matrix.
At recommendation time, we use Thompson sampling (TS) to sample the transition
probabilities and allocate the best series of actions with analytical solution
through exact dynamic programming. The way that we formulate the problem allows
us to leverage TS's efficiency in balancing exploration and exploitation and
Bandit's convenience in modeling actions' incompatibility. In the simulation
study, we observe the proposed MDP with Bandits algorithm outperforms
Q-learning with $\epsilon$-greedy and decreasing $\epsilon$, independent
Bandits, and interaction Bandits. We also find the proposed algorithm's
performance is the most robust to changes in the across-page interdependence
strength.
Related papers
- Amortizing intractable inference in large language models [56.92471123778389]
We use amortized Bayesian inference to sample from intractable posterior distributions.
We empirically demonstrate that this distribution-matching paradigm of LLM fine-tuning can serve as an effective alternative to maximum-likelihood training.
As an important application, we interpret chain-of-thought reasoning as a latent variable modeling problem.
arXiv Detail & Related papers (2023-10-06T16:36:08Z) - Bayesian Learning of Optimal Policies in Markov Decision Processes with Countably Infinite State-Space [0.0]
We study the problem of optimal control of a family of discrete-time countable state-space Markov Decision Processes.
We propose an algorithm based on Thompson sampling with dynamically-sized episodes.
We show that our algorithm can be applied to develop approximately optimal control algorithms.
arXiv Detail & Related papers (2023-06-05T03:57:16Z) - Variance-Dependent Regret Bounds for Linear Bandits and Reinforcement
Learning: Adaptivity and Computational Efficiency [90.40062452292091]
We present the first computationally efficient algorithm for linear bandits with heteroscedastic noise.
Our algorithm is adaptive to the unknown variance of noise and achieves an $tildeO(d sqrtsum_k = 1K sigma_k2 + d)$ regret.
We also propose a variance-adaptive algorithm for linear mixture Markov decision processes (MDPs) in reinforcement learning.
arXiv Detail & Related papers (2023-02-21T00:17:24Z) - Policy Gradient With Serial Markov Chain Reasoning [10.152838128195468]
We introduce a new framework that performs decision-making in reinforcement learning as an iterative reasoning process.
We show our framework has several useful properties that are inherently missing from traditional RL.
Our resulting algorithm achieves state-of-the-art performance in popular Mujoco and DeepMind Control benchmarks.
arXiv Detail & Related papers (2022-10-13T06:15:29Z) - Optimal Admission Control for Multiclass Queues with Time-Varying
Arrival Rates via State Abstraction [16.99621896314678]
We consider a novel queuing problem where the decision-maker must choose to accept or reject randomly arriving tasks.
The objective is to decide which tasks to accept so that the total price of tasks processed is maximised over a finite horizon.
We show that the optimal value function has a specific structure, which enables us to solve the hybrid MDP exactly.
arXiv Detail & Related papers (2022-03-14T12:38:13Z) - Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
We investigate the active pure exploration problem in Markov Decision Processes.
Agent sequentially selects actions and, from the resulting system trajectory, aims at the best as fast as possible.
arXiv Detail & Related papers (2021-06-05T09:16:28Z) - Bandit Linear Optimization for Sequential Decision Making and
Extensive-Form Games [102.23975166536326]
Tree-form sequential decision making (TFSDM) extends classical one-shot decision making by modeling tree-form interactions between an agent and a potentially adversarial environment.
It captures the online decision-making problems that each player faces in an extensive-form game, as well as Markov decision processes and partially-observable Markov decision processes where the agent conditions on observed history.
In this paper, we give the first algorithm for the bandit linear optimization problem for dilatedDM that offers both (i) linear-time losses and (ii) $O(sqrtT)$ cumulative regret in
arXiv Detail & Related papers (2021-03-08T05:00:13Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.