Sampling Efficient Deep Reinforcement Learning through Preference-Guided
Stochastic Exploration
- URL: http://arxiv.org/abs/2206.09627v1
- Date: Mon, 20 Jun 2022 08:23:49 GMT
- Title: Sampling Efficient Deep Reinforcement Learning through Preference-Guided
Stochastic Exploration
- Authors: Wenhui Huang, Cong Zhang, Jingda Wu, Xiangkun He, Jie Zhang and Chen
Lv
- Abstract summary: We propose a preference-guided $epsilon$-greedy exploration algorithm for Deep Q-network (DQN)
We show that preference-guided exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration.
- Score: 8.612437964299414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Massive practical works addressed by Deep Q-network (DQN) algorithm have
indicated that stochastic policy, despite its simplicity, is the most
frequently used exploration approach. However, most existing stochastic
exploration approaches either explore new actions heuristically regardless of
Q-values or inevitably introduce bias into the learning process to couple the
sampling with Q-values. In this paper, we propose a novel preference-guided
$\epsilon$-greedy exploration algorithm that can efficiently learn the action
distribution in line with the landscape of Q-values for DQN without introducing
additional bias. Specifically, we design a dual architecture consisting of two
branches, one of which is a copy of DQN, namely the Q-branch. The other branch,
which we call the preference branch, learns the action preference that the DQN
implicit follows. We theoretically prove that the policy improvement theorem
holds for the preference-guided $\epsilon$-greedy policy and experimentally
show that the inferred action preference distribution aligns with the landscape
of corresponding Q-values. Consequently, preference-guided $\epsilon$-greedy
exploration motivates the DQN agent to take diverse actions, i.e., actions with
larger Q-values can be sampled more frequently whereas actions with smaller
Q-values still have a chance to be explored, thus encouraging the exploration.
We assess the proposed method with four well-known DQN variants in nine
different environments. Extensive results confirm the superiority of our
proposed method in terms of performance and convergence speed.
Index Terms- Preference-guided exploration, stochastic policy, data
efficiency, deep reinforcement learning, deep Q-learning.
Related papers
- Multi-agent Reinforcement Learning with Deep Networks for Diverse Q-Vectors [3.9801926395657325]
This paper proposes a deep Q-networks (DQN) algorithm capable of learning various Q-vectors using Max, Nash, and Maximin strategies.
The effectiveness of this approach is demonstrated in an environment where dual robotic arms collaborate to lift a pot.
arXiv Detail & Related papers (2024-06-12T03:30:10Z) - Q-Probe: A Lightweight Approach to Reward Maximization for Language Models [16.801981347658625]
We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function.
At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting.
arXiv Detail & Related papers (2024-02-22T16:43:16Z) - On the Convergence and Sample Complexity Analysis of Deep Q-Networks
with $\epsilon$-Greedy Exploration [86.71396285956044]
This paper provides a theoretical understanding of Deep Q-Network (DQN) with the $varepsilon$-greedy exploration in deep reinforcement learning.
arXiv Detail & Related papers (2023-10-24T20:37:02Z) - DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for
In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning.
DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z) - Careful at Estimation and Bold at Exploration [21.518406902400432]
Policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning.
However, policy-based exploration has two prominent issues: aimless exploration and policy divergence.
We introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient.
arXiv Detail & Related papers (2023-08-22T10:52:46Z) - Quantile Filtered Imitation Learning [49.11859771578969]
quantile filtered imitation learning (QFIL) is a policy improvement operator designed for offline reinforcement learning.
We prove that QFIL gives us a safe policy improvement step with function approximation.
We see that QFIL performs well on the D4RL benchmark.
arXiv Detail & Related papers (2021-12-02T03:08:23Z) - Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm"
Applying this strategy to Q-learning results in Self-correcting Q-learning.
We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z) - Counterfactual Variable Control for Robust and Interpretable Question
Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases.
In this paper, we inspect such spurious "capability" of QA models using causal inference.
We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA)
First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA)
Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.