Hierarchical Adaptive Contextual Bandits for Resource Constraint based
Recommendation
- URL: http://arxiv.org/abs/2004.01136v2
- Date: Mon, 6 Apr 2020 16:56:29 GMT
- Title: Hierarchical Adaptive Contextual Bandits for Resource Constraint based
Recommendation
- Authors: Mengyue Yang, Qingyang Li, Zhiwei Qin, Jieping Ye
- Abstract summary: Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a variety of problems.
In this paper, we propose a hierarchical adaptive contextual bandit method (HATCH) to conduct the policy learning of contextual bandits with a budget constraint.
- Score: 49.69139684065241
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contextual multi-armed bandit (MAB) achieves cutting-edge performance on a
variety of problems. When it comes to real-world scenarios such as
recommendation system and online advertising, however, it is essential to
consider the resource consumption of exploration. In practice, there is
typically non-zero cost associated with executing a recommendation (arm) in the
environment, and hence, the policy should be learned with a fixed exploration
cost constraint. It is challenging to learn a global optimal policy directly,
since it is a NP-hard problem and significantly complicates the exploration and
exploitation trade-off of bandit algorithms. Existing approaches focus on
solving the problems by adopting the greedy policy which estimates the expected
rewards and costs and uses a greedy selection based on each arm's expected
reward/cost ratio using historical observation until the exploration resource
is exhausted. However, existing methods are hard to extend to infinite time
horizon, since the learning process will be terminated when there is no more
resource. In this paper, we propose a hierarchical adaptive contextual bandit
method (HATCH) to conduct the policy learning of contextual bandits with a
budget constraint. HATCH adopts an adaptive method to allocate the exploration
resource based on the remaining resource/time and the estimation of reward
distribution among different user contexts. In addition, we utilize full of
contextual feature information to find the best personalized recommendation.
Finally, in order to prove the theoretical guarantee, we present a regret bound
analysis and prove that HATCH achieves a regret bound as low as $O(\sqrt{T})$.
The experimental results demonstrate the effectiveness and efficiency of the
proposed method on both synthetic data sets and the real-world applications.
Related papers
- A Risk-Averse Framework for Non-Stationary Stochastic Multi-Armed
Bandits [0.0]
In areas of high volatility like healthcare or finance, a naive reward approach often does not accurately capture the complexity of the learning problem.
We propose a framework of adaptive risk-aware strategies that operate in non-stationary environments.
arXiv Detail & Related papers (2023-10-24T19:29:13Z) - A Bandit Approach to Online Pricing for Heterogeneous Edge Resource
Allocation [8.089950414444115]
Two novel online pricing mechanisms are proposed for heterogeneous edge resource allocation.
The mechanisms operate in real-time and do not require prior knowledge of demand distribution.
The proposed posted pricing schemes allow users to select and pay for their preferred resources, with the platform dynamically adjusting resource prices based on observed historical data.
arXiv Detail & Related papers (2023-02-14T10:21:14Z) - Improved Policy Evaluation for Randomized Trials of Algorithmic Resource
Allocation [54.72195809248172]
We present a new estimator leveraging our proposed novel concept, that involves retrospective reshuffling of participants across experimental arms at the end of an RCT.
We prove theoretically that such an estimator is more accurate than common estimators based on sample means.
arXiv Detail & Related papers (2023-02-06T05:17:22Z) - COptiDICE: Offline Constrained Reinforcement Learning via Stationary
Distribution Correction Estimation [73.17078343706909]
offline constrained reinforcement learning (RL) problem, in which the agent aims to compute a policy that maximizes expected return while satisfying given cost constraints, learning only from a pre-collected dataset.
We present an offline constrained RL algorithm that optimize the policy in the space of the stationary distribution.
Our algorithm, COptiDICE, directly estimates the stationary distribution corrections of the optimal policy with respect to returns, while constraining the cost upper bound, with the goal of yielding a cost-conservative policy for actual constraint satisfaction.
arXiv Detail & Related papers (2022-04-19T15:55:47Z) - Online Allocation with Two-sided Resource Constraints [44.5635910908944]
We consider an online allocation problem subject to lower and upper resource constraints, where the requests arrive sequentially.
We propose a new algorithm that obtains $1-O(fracepsilonalpha-epsilon)$ -competitive ratio for the offline problems that know the entire requests ahead of time.
arXiv Detail & Related papers (2021-12-28T02:21:06Z) - Anti-Concentrated Confidence Bonuses for Scalable Exploration [57.91943847134011]
Intrinsic rewards play a central role in handling the exploration-exploitation trade-off.
We introduce emphanti-concentrated confidence bounds for efficiently approximating the elliptical bonus.
We develop a practical variant for deep reinforcement learning that is competitive with contemporary intrinsic rewards on Atari benchmarks.
arXiv Detail & Related papers (2021-10-21T15:25:15Z) - Risk Minimization from Adaptively Collected Data: Guarantees for
Supervised and Policy Learning [57.88785630755165]
Empirical risk minimization (ERM) is the workhorse of machine learning, but its model-agnostic guarantees can fail when we use adaptively collected data.
We study a generic importance sampling weighted ERM algorithm for using adaptively collected data to minimize the average of a loss function over a hypothesis class.
For policy learning, we provide rate-optimal regret guarantees that close an open gap in the existing literature whenever exploration decays to zero.
arXiv Detail & Related papers (2021-06-03T09:50:13Z) - Distributionally Robust Batch Contextual Bandits [20.667213458836734]
Policy learning using historical observational data is an important problem that has found widespread applications.
Existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment.
In this paper, we lift this assumption and aim to learn a distributionally robust policy with incomplete observational data.
arXiv Detail & Related papers (2020-06-10T03:11:40Z) - Average Reward Adjusted Discounted Reinforcement Learning:
Near-Blackwell-Optimal Policies for Real-World Applications [0.0]
Reinforcement learning aims at finding the best stationary policy for a given Markov Decision Process.
This paper provides deep theoretical insights to the widely applied standard discounted reinforcement learning framework.
We establish a novel near-Blackwell-optimal reinforcement learning algorithm.
arXiv Detail & Related papers (2020-04-02T08:05:18Z) - Cost-Sensitive Portfolio Selection via Deep Reinforcement Learning [100.73223416589596]
We propose a cost-sensitive portfolio selection method with deep reinforcement learning.
Specifically, a novel two-stream portfolio policy network is devised to extract both price series patterns and asset correlations.
A new cost-sensitive reward function is developed to maximize the accumulated return and constrain both costs via reinforcement learning.
arXiv Detail & Related papers (2020-03-06T06:28:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.