Related papers: Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback

Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback

URL: http://arxiv.org/abs/2006.07905v2
Date: Tue, 15 Dec 2020 05:59:24 GMT
Title: Combinatorial Pure Exploration with Full-Bandit or Partial Linear Feedback
Authors: Yihan Du, Yuko Kuroki, Wei Chen
Abstract summary: We first study the problem of pure exploration with full-bandit feedback (CPE-BL) In CPE-BL, each pull of action $x$ reports a random feedback vector with expectation of $M_x theta $, where $M_x in mathbbRd$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$. For CPE-PL, we develop the first emtime algorithm, which simultaneously addresses limited feedback, general reward function and action space.
Score: 18.29738891417779
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this paper, we first study the problem of combinatorial pure exploration with full-bandit feedback (CPE-BL), where a learner is given a combinatorial action space $\mathcal{X} \subseteq \{0,1\}^d$, and in each round the learner pulls an action $x \in \mathcal{X}$ and receives a random reward with expectation $x^{\top} \theta$, with $\theta \in \mathbb{R}^d$ a latent and unknown environment vector. The objective is to identify the optimal action with the highest expected reward, using as few samples as possible. For CPE-BL, we design the first {\em polynomial-time adaptive} algorithm, whose sample complexity matches the lower bound (within a logarithmic factor) for a family of instances and has a light dependence of $\Delta_{\min}$ (the smallest gap between the optimal action and sub-optimal actions). Furthermore, we propose a novel generalization of CPE-BL with flexible feedback structures, called combinatorial pure exploration with partial linear feedback (CPE-PL), which encompasses several families of sub-problems including full-bandit feedback, semi-bandit feedback, partial feedback and nonlinear reward functions. In CPE-PL, each pull of action $x$ reports a random feedback vector with expectation of $M_{x} \theta $, where $M_x \in \mathbb{R}^{m_x \times d}$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$. For CPE-PL, we develop the first {\em polynomial-time} algorithm, which simultaneously addresses limited feedback, general reward function and combinatorial action space, and provide its sample complexity analysis. Our empirical evaluation demonstrates that our algorithms run orders of magnitude faster than the existing ones, and our CPE-BL algorithm is robust across different $\Delta_{\min}$ settings while our CPE-PL algorithm is the only one returning correct answers for nonlinear reward functions.

Related papers

Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel. An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost. Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z)
Thompson Sampling for Real-Valued Combinatorial Pure Exploration of Multi-Armed Bandit [65.268245109828]
We study the real-valued pure exploration of the multi-armed bandit (R-CPE-MAB) problem. We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$.
arXiv Detail & Related papers (2023-08-20T11:56:02Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs) At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward. We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z)
Introducing the Expohedron for Efficient Pareto-optimal Fairness-Utility Amortizations in Repeated Rankings [9.066817876491053]
We consider the problem of computing a sequence of rankings that maximizes consumer-side utility while minimizing producer-side individual unfairness of exposure. We introduce a geometrical object, a polytope that we call expohedron, whose points represent all achievable exposures of items for a Position Based Model. Our approach compares favorably to linear or quadratic programming baselines in terms of algorithmic complexity and empirical runtime. Our solution can be expressed as a distribution over only $n$ permutations, instead of the $(n-1)2 + 1$ achieved with BvN decompositions.
arXiv Detail & Related papers (2022-02-07T14:43:35Z)
Reward-Free Model-Based Reinforcement Learning with Linear Function Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs) In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy. We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z)
Randomized Exploration for Reinforcement Learning with General Value Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm. Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.