Multi-armed Bandit Algorithms on System-on-Chip: Go Frequentist or
Bayesian?
- URL: http://arxiv.org/abs/2106.02855v1
- Date: Sat, 5 Jun 2021 10:07:31 GMT
- Title: Multi-armed Bandit Algorithms on System-on-Chip: Go Frequentist or
Bayesian?
- Authors: S. V. Sai Santosh and Sumit J. Darak
- Abstract summary: Multi-armed Bandit (MAB) algorithms identify the best arm among multiple arms.
We propose a reconfigurable and intelligent MAB (RI-MAB) framework.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-armed Bandit (MAB) algorithms identify the best arm among multiple arms
via exploration-exploitation trade-off without prior knowledge of arm
statistics. Their usefulness in wireless radio, IoT, and robotics demand
deployment on edge devices, and hence, a mapping on system-on-chip (SoC) is
desired. Theoretically, the Bayesian approach-based Thompson Sampling (TS)
algorithm offers better performance than the frequentist approach-based Upper
Confidence Bound (UCB) algorithm. However, TS is not synthesizable due to Beta
function. We address this problem by approximating it via a pseudo-random
number generator-based approach and efficiently realize the TS algorithm on
Zynq SoC. In practice, the type of arms distribution (e.g., Bernoulli,
Gaussian, etc.) is unknown and hence, a single algorithm may not be optimal. We
propose a reconfigurable and intelligent MAB (RI-MAB) framework. Here,
intelligence enables the identification of appropriate MAB algorithms for a
given environment, and reconfigurability allows on-the-fly switching between
algorithms on the SoC. This eliminates the need for parallel implementation of
algorithms resulting in huge savings in resources and power consumption. We
analyze the functional correctness, area, power, and execution time of the
proposed and existing architectures for various arm distributions, word-length,
and hardware-software co-design approaches. We demonstrate the superiority of
the RI-MAB over TS and UCB only architectures.
Related papers
- GreedyML: A Parallel Algorithm for Maximizing Submodular Functions [2.9998889086656586]
We describe a parallel approximation algorithm for maximizing monotone submodular functions on distributed memory multiprocessors.
Our work is motivated by the need to solve submodular optimization problems on massive data sets, for practical applications in areas such as data summarization, machine learning, and graph sparsification.
arXiv Detail & Related papers (2024-03-15T14:19:09Z) - Best Arm Identification with Fixed Budget: A Large Deviation Perspective [54.305323903582845]
We present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
In particular, we present sred, a truly adaptive algorithm that can reject arms in it any round based on the observed empirical gaps between the rewards of various arms.
arXiv Detail & Related papers (2023-12-19T13:17:43Z) - A Fast Algorithm for PAC Combinatorial Pure Exploration [21.376800678915558]
We consider the problem of Combinatorial Pure Exploration (CPE), which deals with finding a set or arms with a high reward.
Previous algorithms for this problem are highly computationally intensive, making them impractical even for mildly large problems.
We propose a new CPE algorithm in the PAC setting, which is computationally light weight and can easily be applied to problems with tens of thousands of arms.
arXiv Detail & Related papers (2021-12-08T09:52:56Z) - Reinforcement Learning for Finite-Horizon Restless Multi-Armed
Multi-Action Bandits [8.136957953239254]
We study a finite-horizon restless multi-armed bandit problem with multiple actions dubbed R(MA)2B.
The state of each arm evolves according to a controlled Markov decision process (MDP), and the reward of pulling an arm depends on both the current state of the corresponding MDP and the action taken.
Since finding the optimal policy is typically intractable, we propose a computationally appealing index policy which we call Occupancy-Measured-Reward Index Policy.
arXiv Detail & Related papers (2021-09-20T21:40:12Z) - Top-$k$ eXtreme Contextual Bandits with Arm Hierarchy [71.17938026619068]
We study the top-$k$ extreme contextual bandits problem, where the total number of arms can be enormous.
We first propose an algorithm for the non-extreme realizable setting, utilizing the Inverse Gap Weighting strategy.
We show that our algorithm has a regret guarantee of $O(ksqrt(A-k+1)T log (|mathcalF|T))$.
arXiv Detail & Related papers (2021-02-15T19:10:52Z) - Joint Deep Reinforcement Learning and Unfolding: Beam Selection and
Precoding for mmWave Multiuser MIMO with Lens Arrays [54.43962058166702]
millimeter wave (mmWave) multiuser multiple-input multiple-output (MU-MIMO) systems with discrete lens arrays have received great attention.
In this work, we investigate the joint design of a beam precoding matrix for mmWave MU-MIMO systems with DLA.
arXiv Detail & Related papers (2021-01-05T03:55:04Z) - Resource Allocation in Multi-armed Bandit Exploration: Overcoming
Sublinear Scaling with Adaptive Parallelism [107.48538091418412]
We study exploration in multi-armed bandits when we have access to a divisible resource that can be allocated in varying amounts to arm pulls.
We focus in particular on the allocation of distributed computing resources, where we may obtain results faster by allocating more resources per pull.
arXiv Detail & Related papers (2020-10-31T18:19:29Z) - Adaptive Sampling for Best Policy Identification in Markov Decision
Processes [79.4957965474334]
We investigate the problem of best-policy identification in discounted Markov Decision (MDPs) when the learner has access to a generative model.
The advantages of state-of-the-art algorithms are discussed and illustrated.
arXiv Detail & Related papers (2020-09-28T15:22:24Z) - Model Selection in Contextual Stochastic Bandit Problems [51.94632035240787]
We develop a meta-algorithm that selects between base algorithms.
We show through a lower bound that even when one of the base algorithms has $O(sqrtT)$ regret, in general it is impossible to get better than $Omega(sqrtT)$ regret.
arXiv Detail & Related papers (2020-03-03T18:46:34Z) - Intelligent and Reconfigurable Architecture for KL Divergence Based
Online Machine Learning Algorithm [0.0]
Online machine learning (OML) algorithms do not need any training phase and can be deployed directly in an unknown environment.
Online machine learning (OML) algorithms do not need any training phase and can be deployed directly in an unknown environment.
arXiv Detail & Related papers (2020-02-18T16:39:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.