Can Q-learning solve Multi Armed Bantids?
        - URL: http://arxiv.org/abs/2110.10934v1
- Date: Thu, 21 Oct 2021 07:08:30 GMT
- Title: Can Q-learning solve Multi Armed Bantids?
- Authors: Refael Vivanti
- Abstract summary: We show that current reinforcement learning algorithms are not capable of solving Multi-Armed-Bandit problems.
This stems from variance differences between policies, which causes two problems.
We propose the Adaptive Symmetric Reward Noising (ASRN) method, by which we mean equalizing the rewards variance across different policies.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   When a reinforcement learning (RL) method has to decide between several
optional policies by solely looking at the received reward, it has to
implicitly optimize a Multi-Armed-Bandit (MAB) problem. This arises the
question: are current RL algorithms capable of solving MAB problems? We claim
that the surprising answer is no. In our experiments we show that in some
situations they fail to solve a basic MAB problem, and in many common
situations they have a hard time: They suffer from regression in results during
training, sensitivity to initialization and high sample complexity. We claim
that this stems from variance differences between policies, which causes two
problems: The first problem is the "Boring Policy Trap" where each policy have
a different implicit exploration depends on its rewards variance, and leaving a
boring, or low variance, policy is less likely due to its low implicit
exploration. The second problem is the "Manipulative Consultant" problem, where
value-estimation functions used in deep RL algorithms such as DQN or deep Actor
Critic methods, maximize estimation precision rather than mean rewards, and
have a better loss in low-variance policies, which cause the network to
converge to a sub-optimal policy. Cognitive experiments on humans showed that
noised reward signals may paradoxically improve performance. We explain this
using the aforementioned problems, claiming that both humans and algorithms may
share similar challenges in decision making.
  Inspired by this result, we propose the Adaptive Symmetric Reward Noising
(ASRN) method, by which we mean equalizing the rewards variance across
different policies, thus avoiding the two problems without affecting the
environment's mean rewards behavior. We demonstrate that the ASRN scheme can
dramatically improve the results.
 
      
        Related papers
        - Quantile-Optimal Policy Learning under Unmeasured Confounding [55.72891849926314]
 We study quantile-optimal policy learning where the goal is to find a policy whose reward distribution has the largest $alpha$-quantile for some $alpha in (0, 1)$.<n>Such a problem suffers from three main challenges: (i) nonlinearity of the quantile objective as a functional of the reward distribution, (ii) unobserved confounding issue, and (iii) insufficient coverage of the offline dataset.
 arXiv  Detail & Related papers  (2025-06-08T13:37:38Z)
- Model-Based Epistemic Variance of Values for Risk-Aware Policy   Optimization [59.758009422067]
 We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning.
We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values.
We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
 arXiv  Detail & Related papers  (2023-12-07T15:55:58Z)
- Inverse Reinforcement Learning with the Average Reward Criterion [3.719493310637464]
 We study the problem of Inverse Reinforcement Learning (IRL) with an average-reward criterion.
The goal is to recover an unknown policy and a reward function when the agent only has samples of states and actions from an experienced agent.
 arXiv  Detail & Related papers  (2023-05-24T01:12:08Z)
- Mean-Semivariance Policy Optimization via Risk-Averse Reinforcement
  Learning [12.022303947412917]
 This paper aims at optimizing the mean-semivariance criterion in reinforcement learning w.r.t. steady rewards.
We reveal that the MSV problem can be solved by iteratively solving a sequence of RL problems with a policy-dependent reward function.
We propose two on-policy algorithms based on the policy gradient theory and the trust region method.
 arXiv  Detail & Related papers  (2022-06-15T08:32:53Z)
- CAMEO: Curiosity Augmented Metropolis for Exploratory Optimal Policies [62.39667564455059]
 We consider and study a distribution of optimal policies.
In experimental simulations we show that CAMEO indeed obtains policies that all solve classic control problems.
We further show that the different policies we sample present different risk profiles, corresponding to interesting practical applications in interpretability.
 arXiv  Detail & Related papers  (2022-05-19T09:48:56Z)
- Online Apprenticeship Learning [58.45089581278177]
 In Apprenticeship Learning (AL), we are given a Markov Decision Process (MDP) without access to the cost function.
The goal is to find a policy that matches the expert's performance on some predefined set of cost functions.
We show that the OAL problem can be effectively solved by combining two mirror descent based no-regret algorithms.
 arXiv  Detail & Related papers  (2021-02-13T12:57:51Z)
- Risk-Sensitive Deep RL: Variance-Constrained Actor-Critic Provably Finds
  Globally Optimal Policy [95.98698822755227]
 We make the first attempt to study risk-sensitive deep reinforcement learning under the average reward setting with the variance risk criteria.
We propose an actor-critic algorithm that iteratively and efficiently updates the policy, the Lagrange multiplier, and the Fenchel dual variable.
 arXiv  Detail & Related papers  (2020-12-28T05:02:26Z)
- Hindsight Experience Replay with Kronecker Product Approximate Curvature [5.441932327359051]
 Hindsight Experience Replay (HER) is one of the efficient algorithm to solve Reinforcement Learning tasks.
But due to its reduced sample efficiency and slower convergence HER fails to perform effectively.
Natural gradients solves these challenges by converging the model parameters better.
Our proposed method solves the above mentioned challenges with better sample efficiency and faster convergence with increased success rate.
 arXiv  Detail & Related papers  (2020-10-09T20:25:14Z)
- Active Finite Reward Automaton Inference and Reinforcement Learning
  Using Queries and Counterexamples [31.31937554018045]
 Deep reinforcement learning (RL) methods require intensive data from the exploration of the environment to achieve satisfactory performance.
We propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations.
Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm.
 arXiv  Detail & Related papers  (2020-06-28T21:13:08Z)
- DDPG++: Striving for Simplicity in Continuous-control Off-Policy
  Reinforcement Learning [95.60782037764928]
 We show that simple Deterministic Policy Gradient works remarkably well as long as the overestimation bias is controlled.
Second, we pinpoint training instabilities, typical of off-policy algorithms, to the greedy policy update step.
Third, we show that ideas in the propensity estimation literature can be used to importance-sample transitions from replay buffer and update policy to prevent deterioration of performance.
 arXiv  Detail & Related papers  (2020-06-26T20:21:12Z)
- DisCor: Corrective Feedback in Reinforcement Learning via Distribution
  Correction [96.90215318875859]
 We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from corrective feedback.
We propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training.
 arXiv  Detail & Related papers  (2020-03-16T16:18:52Z)
- Relative Importance Sampling for off-Policy Actor-Critic in Deep   Reinforcement Learning [32.66049977978746]
 Off-policy learning exhibits greater instability when compared to on-policy learning in reinforcement learning (RL)
We propose a smooth form of importance sampling, specifically relative importance sampling (RIS), which mitigates variance and stabilizes learning.
Our methods performed better than or equal to several state-of-the-art RL benchmarks on OpenAI Gym challenges and synthetic datasets.
 arXiv  Detail & Related papers  (2018-10-30T07:41:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.