SlateFree: a Model-Free Decomposition for Reinforcement Learning with
Slate Actions
- URL: http://arxiv.org/abs/2209.01876v1
- Date: Mon, 5 Sep 2022 10:15:16 GMT
- Title: SlateFree: a Model-Free Decomposition for Reinforcement Learning with
Slate Actions
- Authors: Anastasios Giovanidis
- Abstract summary: We consider the problem of sequential recommendations, where at each step an agent proposes some slate of $N$ distinct items to a user.
We propose a novel model-free SARSA and Q-learning algorithm that performs $N$ parallel iterations per step, without any prior user knowledge.
- Score: 5.647516208808729
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We consider the problem of sequential recommendations, where at each step an
agent proposes some slate of $N$ distinct items to a user from a much larger
catalog of size $K>>N$. The user has unknown preferences towards the
recommendations and the agent takes sequential actions that optimise (in our
case minimise) some user-related cost, with the help of Reinforcement Learning.
The possible item combinations for a slate is $\binom{K}{N}$, an enormous
number rendering value iteration methods intractable. We prove that the
slate-MDP can actually be decomposed using just $K$ item-related $Q$ functions
per state, which describe the problem in a more compact and efficient way.
Based on this, we propose a novel model-free SARSA and Q-learning algorithm
that performs $N$ parallel iterations per step, without any prior user
knowledge. We call this method \texttt{SlateFree}, i.e. free-of-slates, and we
show numerically that it converges very fast to the exact optimum for arbitrary
user profiles, and that it outperforms alternatives from the literature.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Sample Efficient Reinforcement Learning with Partial Dynamics Knowledge [0.704590071265998]
We study the sample complexity of online Q-learning methods when some prior knowledge about the dynamics is available or can be learned efficiently.
We present an optimistic Q-learning algorithm that achieves $tildemathcalO(textPoly(H)sqrtSAT)$ regret under perfect knowledge of $f$.
arXiv Detail & Related papers (2023-12-19T19:53:58Z) - An Oblivious Stochastic Composite Optimization Algorithm for Eigenvalue
Optimization Problems [76.2042837251496]
We introduce two oblivious mirror descent algorithms based on a complementary composite setting.
Remarkably, both algorithms work without prior knowledge of the Lipschitz constant or smoothness of the objective function.
We show how to extend our framework to scale and demonstrate the efficiency and robustness of our methods on large scale semidefinite programs.
arXiv Detail & Related papers (2023-06-30T08:34:29Z) - Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic
Shortest Path [106.37656068276902]
We study the sample complexity of learning an $epsilon$-optimal policy in the Shortest Path (SSP) problem.
We derive complexity bounds when the learner has access to a generative model.
We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_min$, and maximum expected cost of the optimal policy over all states $B_star$.
arXiv Detail & Related papers (2022-10-10T18:34:32Z) - A Spectral Approach to Item Response Theory [6.5268245109828005]
We propose a emphnew item estimation algorithm for the Rasch model.
The core of our algorithm is the computation of the stationary distribution of a Markov chain defined on an item-item graph.
Experiments on synthetic and real-life datasets show that our algorithm is scalable, accurate, and competitive with the most commonly used methods in the literature.
arXiv Detail & Related papers (2022-10-09T18:57:08Z) - Best Policy Identification in Linear MDPs [70.57916977441262]
We investigate the problem of best identification in discounted linear Markov+Delta Decision in the fixed confidence setting under a generative model.
The lower bound as the solution of an intricate non- optimization program can be used as the starting point to devise such algorithms.
arXiv Detail & Related papers (2022-08-11T04:12:50Z) - Robust Methods for High-Dimensional Linear Learning [0.0]
We propose statistically robust and computationally efficient linear learning methods in the high-dimensional batch setting.
We instantiate our framework on several applications including vanilla sparse, group-sparse and low-rank matrix recovery.
For vanilla $s$-sparsity, we are able to reach the $slog (d)/n$ rate under heavy-tails and $eta$-corruption.
arXiv Detail & Related papers (2022-08-10T17:00:41Z) - Multi-block-Single-probe Variance Reduced Estimator for Coupled
Compositional Optimization [49.58290066287418]
We propose a novel method named Multi-block-probe Variance Reduced (MSVR) to alleviate the complexity of compositional problems.
Our results improve upon prior ones in several aspects, including the order of sample complexities and dependence on strongity.
arXiv Detail & Related papers (2022-07-18T12:03:26Z) - Multinomial Logit Contextual Bandits: Provable Optimality and
Practicality [15.533842336139063]
We consider a sequential assortment selection problem where the user choice is given by a multinomial logit (MNL) choice model whose parameters are unknown.
We propose upper confidence bound based algorithms for this MNL contextual bandit.
We show that a simple variant of the algorithm achieves the optimal regret for a broad class of important applications.
arXiv Detail & Related papers (2021-03-25T15:42:25Z) - Sparse Regression at Scale: Branch-and-Bound rooted in First-Order
Optimization [6.037383467521294]
We present a new exact MIP framework for $ell_0$ regularized regression.
Our framework can scale to $p sim 107$, achieving speedups of at least $5000$x.
We open source the implementation through our toolkit L0BnB.
arXiv Detail & Related papers (2020-04-13T18:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.