On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning
Problems in High-dimension
- URL: http://arxiv.org/abs/2011.05927v3
- Date: Mon, 28 Mar 2022 17:00:25 GMT
- Title: On Using Hamiltonian Monte Carlo Sampling for Reinforcement Learning
Problems in High-dimension
- Authors: Udari Madhushani, Biswadip Dey, Naomi Ehrich Leonard, Amit Chakraborty
- Abstract summary: Hamiltonian Monte Carlo (HMC) sampling offers a tractable way to generate data for training RL algorithms.
We introduce a framework, called textitHamiltonian $Q$-Learning, that demonstrates, both theoretically and empirically, that $Q$ values can be learned from a dataset generated by HMC samples of actions, rewards, and state transitions.
- Score: 7.200655637873445
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Value function based reinforcement learning (RL) algorithms, for example,
$Q$-learning, learn optimal policies from datasets of actions, rewards, and
state transitions. However, when the underlying state transition dynamics are
stochastic and evolve on a high-dimensional space, generating independent and
identically distributed (IID) data samples for creating these datasets poses a
significant challenge due to the intractability of the associated normalizing
integral. In these scenarios, Hamiltonian Monte Carlo (HMC) sampling offers a
computationally tractable way to generate data for training RL algorithms. In
this paper, we introduce a framework, called \textit{Hamiltonian $Q$-Learning},
that demonstrates, both theoretically and empirically, that $Q$ values can be
learned from a dataset generated by HMC samples of actions, rewards, and state
transitions. Furthermore, to exploit the underlying low-rank structure of the
$Q$ function, Hamiltonian $Q$-Learning uses a matrix completion algorithm for
reconstructing the updated $Q$ function from $Q$ value updates over a much
smaller subset of state-action pairs. Thus, by providing an efficient way to
apply $Q$-learning in stochastic, high-dimensional settings, the proposed
approach broadens the scope of RL algorithms for real-world applications.
Related papers
- Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Monte Carlo Tree Search Boosts Reasoning via Iterative Preference Learning [55.96599486604344]
We introduce an approach aimed at enhancing the reasoning capabilities of Large Language Models (LLMs) through an iterative preference learning process.
We use Monte Carlo Tree Search (MCTS) to iteratively collect preference data, utilizing its look-ahead ability to break down instance-level rewards into more granular step-level signals.
The proposed algorithm employs Direct Preference Optimization (DPO) to update the LLM policy using this newly generated step-level preference data.
arXiv Detail & Related papers (2024-05-01T11:10:24Z) - A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning
with General Function Approximation [66.26739783789387]
We propose a new algorithm, Monotonic Q-Learning with Upper Confidence Bound (MQL-UCB) for reinforcement learning.
MQL-UCB achieves minimax optimal regret of $tildeO(dsqrtHK)$ when $K$ is sufficiently large and near-optimal policy switching cost.
Our work sheds light on designing provably sample-efficient and deployment-efficient Q-learning with nonlinear function approximation.
arXiv Detail & Related papers (2023-11-26T08:31:57Z) - Provable and Practical: Efficient Exploration in Reinforcement Learning via Langevin Monte Carlo [104.9535542833054]
We present a scalable and effective exploration strategy based on Thompson sampling for reinforcement learning (RL)
We instead directly sample the Q function from its posterior distribution, by using Langevin Monte Carlo.
Our approach achieves better or similar results compared with state-of-the-art deep RL algorithms on several challenging exploration tasks from the Atari57 suite.
arXiv Detail & Related papers (2023-05-29T17:11:28Z) - An Experimental Design Perspective on Model-Based Reinforcement Learning [73.37942845983417]
In practical applications of RL, it is expensive to observe state transitions from the environment.
We propose an acquisition function that quantifies how much information a state-action pair would provide about the optimal solution to a Markov decision process.
arXiv Detail & Related papers (2021-12-09T23:13:57Z) - Learning the hypotheses space from data through a U-curve algorithm: a
statistically consistent complexity regularizer for Model Selection [0.0]
This paper proposes a data-driven systematic, consistent and non-exhaustive approach to Model Selection.
Our main contributions are a data-driven general learning algorithm to perform regularized Model Selection on $mathbbL(mathcalH)$.
A remarkable consequence of this approach are conditions under which a non-exhaustive search of $mathbbL(mathcalH)$ can return an optimal solution.
arXiv Detail & Related papers (2021-09-08T18:28:56Z) - On Function Approximation in Reinforcement Learning: Optimism in the
Face of Large State Spaces [208.67848059021915]
We study the exploration-exploitation tradeoff at the core of reinforcement learning.
In particular, we prove that the complexity of the function class $mathcalF$ characterizes the complexity of the function.
Our regret bounds are independent of the number of episodes.
arXiv Detail & Related papers (2020-11-09T18:32:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.