Combinatorial Pure Exploration with Full-Bandit or Partial Linear
Feedback
- URL: http://arxiv.org/abs/2006.07905v2
- Date: Tue, 15 Dec 2020 05:59:24 GMT
- Title: Combinatorial Pure Exploration with Full-Bandit or Partial Linear
Feedback
- Authors: Yihan Du, Yuko Kuroki, Wei Chen
- Abstract summary: We first study the problem of pure exploration with full-bandit feedback (CPE-BL)
In CPE-BL, each pull of action $x$ reports a random feedback vector with expectation of $M_x theta $, where $M_x in mathbbRd$ is a transformation matrix for $x$, and gains a random (possibly nonlinear) reward related to $x$.
For CPE-PL, we develop the first emtime algorithm, which simultaneously addresses limited feedback, general reward function and action space.
- Score: 18.29738891417779
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we first study the problem of combinatorial pure exploration
with full-bandit feedback (CPE-BL), where a learner is given a combinatorial
action space $\mathcal{X} \subseteq \{0,1\}^d$, and in each round the learner
pulls an action $x \in \mathcal{X}$ and receives a random reward with
expectation $x^{\top} \theta$, with $\theta \in \mathbb{R}^d$ a latent and
unknown environment vector. The objective is to identify the optimal action
with the highest expected reward, using as few samples as possible. For CPE-BL,
we design the first {\em polynomial-time adaptive} algorithm, whose sample
complexity matches the lower bound (within a logarithmic factor) for a family
of instances and has a light dependence of $\Delta_{\min}$ (the smallest gap
between the optimal action and sub-optimal actions). Furthermore, we propose a
novel generalization of CPE-BL with flexible feedback structures, called
combinatorial pure exploration with partial linear feedback (CPE-PL), which
encompasses several families of sub-problems including full-bandit feedback,
semi-bandit feedback, partial feedback and nonlinear reward functions. In
CPE-PL, each pull of action $x$ reports a random feedback vector with
expectation of $M_{x} \theta $, where $M_x \in \mathbb{R}^{m_x \times d}$ is a
transformation matrix for $x$, and gains a random (possibly nonlinear) reward
related to $x$. For CPE-PL, we develop the first {\em polynomial-time}
algorithm, which simultaneously addresses limited feedback, general reward
function and combinatorial action space, and provide its sample complexity
analysis. Our empirical evaluation demonstrates that our algorithms run orders
of magnitude faster than the existing ones, and our CPE-BL algorithm is robust
across different $\Delta_{\min}$ settings while our CPE-PL algorithm is the
only one returning correct answers for nonlinear reward functions.
Related papers
- Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators.
Key to our solution is a novel projection technique based on ideas from harmonic analysis.
Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z) - Nearly Minimax Optimal Regret for Learning Linear Mixture Stochastic
Shortest Path [80.60592344361073]
We study the Shortest Path (SSP) problem with a linear mixture transition kernel.
An agent repeatedly interacts with a environment and seeks to reach certain goal state while minimizing the cumulative cost.
Existing works often assume a strictly positive lower bound of the iteration cost function or an upper bound of the expected length for the optimal policy.
arXiv Detail & Related papers (2024-02-14T07:52:00Z) - Thompson Sampling for Real-Valued Combinatorial Pure Exploration of
Multi-Armed Bandit [65.268245109828]
We study the real-valued pure exploration of the multi-armed bandit (R-CPE-MAB) problem.
We introduce an algorithm named the Generalized Thompson Sampling Explore (GenTS-Explore) algorithm, which is the first algorithm that can work even when the size of the action set is exponentially large in $d$.
arXiv Detail & Related papers (2023-08-20T11:56:02Z) - Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
We propose a new reward-free algorithm for learning linear Markov decision processes (MDPs)
At the core of our algorithm is uncertainty-weighted value-targeted regression with exploration-driven pseudo-reward.
We show that our algorithm only needs to explore $tilde O( d2varepsilon-2)$ episodes to find an $varepsilon$-optimal policy.
arXiv Detail & Related papers (2023-03-17T17:53:28Z) - Introducing the Expohedron for Efficient Pareto-optimal Fairness-Utility
Amortizations in Repeated Rankings [9.066817876491053]
We consider the problem of computing a sequence of rankings that maximizes consumer-side utility while minimizing producer-side individual unfairness of exposure.
We introduce a geometrical object, a polytope that we call expohedron, whose points represent all achievable exposures of items for a Position Based Model.
Our approach compares favorably to linear or quadratic programming baselines in terms of algorithmic complexity and empirical runtime.
Our solution can be expressed as a distribution over only $n$ permutations, instead of the $(n-1)2 + 1$ achieved with BvN decompositions.
arXiv Detail & Related papers (2022-02-07T14:43:35Z) - Reward-Free Model-Based Reinforcement Learning with Linear Function
Approximation [92.99933928528797]
We study the model-based reward-free reinforcement learning with linear function approximation for episodic Markov decision processes (MDPs)
In the planning phase, the agent is given a specific reward function and uses samples collected from the exploration phase to learn a good policy.
We show that to obtain an $epsilon$-optimal policy for arbitrary reward function, UCRL-RFE needs to sample at most $tilde O(H4d(H + d)epsilon-2)$ episodes.
arXiv Detail & Related papers (2021-10-12T23:03:58Z) - Randomized Exploration for Reinforcement Learning with General Value
Function Approximation [122.70803181751135]
We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm.
Our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises.
We complement the theory with an empirical evaluation across known difficult exploration tasks.
arXiv Detail & Related papers (2021-06-15T02:23:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.