Related papers: Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration

Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration

URL: http://arxiv.org/abs/2206.09627v1
Date: Mon, 20 Jun 2022 08:23:49 GMT
Title: Sampling Efficient Deep Reinforcement Learning through Preference-Guided Stochastic Exploration
Authors: Wenhui Huang, Cong Zhang, Jingda Wu, Xiangkun He, Jie Zhang and Chen Lv
Abstract summary: We propose a preference-guided $epsilon$-greedy exploration algorithm for Deep Q-network (DQN) We show that preference-guided exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration.
Score: 8.612437964299414
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Massive practical works addressed by Deep Q-network (DQN) algorithm have indicated that stochastic policy, despite its simplicity, is the most frequently used exploration approach. However, most existing stochastic exploration approaches either explore new actions heuristically regardless of Q-values or inevitably introduce bias into the learning process to couple the sampling with Q-values. In this paper, we propose a novel preference-guided $\epsilon$-greedy exploration algorithm that can efficiently learn the action distribution in line with the landscape of Q-values for DQN without introducing additional bias. Specifically, we design a dual architecture consisting of two branches, one of which is a copy of DQN, namely the Q-branch. The other branch, which we call the preference branch, learns the action preference that the DQN implicit follows. We theoretically prove that the policy improvement theorem holds for the preference-guided $\epsilon$-greedy policy and experimentally show that the inferred action preference distribution aligns with the landscape of corresponding Q-values. Consequently, preference-guided $\epsilon$-greedy exploration motivates the DQN agent to take diverse actions, i.e., actions with larger Q-values can be sampled more frequently whereas actions with smaller Q-values still have a chance to be explored, thus encouraging the exploration. We assess the proposed method with four well-known DQN variants in nine different environments. Extensive results confirm the superiority of our proposed method in terms of performance and convergence speed. Index Terms- Preference-guided exploration, stochastic policy, data efficiency, deep reinforcement learning, deep Q-learning.

Related papers

$β$-DQN: Improving Deep Q-Learning By Evolving the Behavior [41.13282452752521]
$beta$-DQN is a simple and efficient exploration method that augments the standard DQN with a behavior function. An adaptive meta-controller is designed to select an effective policy for each episode, enabling flexible and explainable exploration. Experiments on both simple and challenging exploration domains show that $beta$-DQN outperforms existing baseline methods.
arXiv Detail & Related papers (2025-01-01T18:12:18Z)
Multi-agent Reinforcement Learning with Deep Networks for Diverse Q-Vectors [3.9801926395657325]
This paper proposes a deep Q-networks (DQN) algorithm capable of learning various Q-vectors using Max, Nash, and Maximin strategies. The effectiveness of this approach is demonstrated in an environment where dual robotic arms collaborate to lift a pot.
arXiv Detail & Related papers (2024-06-12T03:30:10Z)
Q-Probe: A Lightweight Approach to Reward Maximization for Language Models [16.801981347658625]
We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting.
arXiv Detail & Related papers (2024-02-22T16:43:16Z)
On the Convergence and Sample Complexity Analysis of Deep Q-Networks with $\epsilon$-Greedy Exploration [86.71396285956044]
This paper provides a theoretical understanding of Deep Q-Network (DQN) with the $varepsilon$-greedy exploration in deep reinforcement learning.
arXiv Detail & Related papers (2023-10-24T20:37:02Z)
DQ-LoRe: Dual Queries with Low Rank Approximation Re-ranking for In-Context Learning [66.85379279041128]
In this study, we introduce a framework that leverages Dual Queries and Low-rank approximation Re-ranking to automatically select exemplars for in-context learning. DQ-LoRe significantly outperforms prior state-of-the-art methods in the automatic selection of exemplars for GPT-4, enhancing performance from 92.5% to 94.2%.
arXiv Detail & Related papers (2023-10-04T16:44:37Z)
Careful at Estimation and Bold at Exploration [21.518406902400432]
Policy-based exploration is beneficial for continuous action space in deterministic policy reinforcement learning. However, policy-based exploration has two prominent issues: aimless exploration and policy divergence. We introduce a novel exploration strategy to mitigate these issues, separate from the policy gradient.
arXiv Detail & Related papers (2023-08-22T10:52:46Z)
Quantile Filtered Imitation Learning [49.11859771578969]
quantile filtered imitation learning (QFIL) is a policy improvement operator designed for offline reinforcement learning. We prove that QFIL gives us a safe policy improvement step with function approximation. We see that QFIL performs well on the D4RL benchmark.
arXiv Detail & Related papers (2021-12-02T03:08:23Z)
Self-correcting Q-Learning [14.178899938667161]
We introduce a new way to address the bias in the form of a "self-correcting algorithm" Applying this strategy to Q-learning results in Self-correcting Q-learning. We show theoretically that this new algorithm enjoys the same convergence guarantees as Q-learning while being more accurate.
arXiv Detail & Related papers (2020-12-02T11:36:24Z)
Counterfactual Variable Control for Robust and Interpretable Question Answering [57.25261576239862]
Deep neural network based question answering (QA) models are neither robust nor explainable in many cases. In this paper, we inspect such spurious "capability" of QA models using causal inference. We propose a novel approach called Counterfactual Variable Control (CVC) that explicitly mitigates any shortcut correlation.
arXiv Detail & Related papers (2020-10-12T10:09:05Z)
Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)
Harvesting and Refining Question-Answer Pairs for Unsupervised QA [95.9105154311491]
We introduce two approaches to improve unsupervised Question Answering (QA) First, we harvest lexically and syntactically divergent questions from Wikipedia to automatically construct a corpus of question-answer pairs (named as RefQA) Second, we take advantage of the QA model to extract more appropriate answers, which iteratively refines data over RefQA.
arXiv Detail & Related papers (2020-05-06T15:56:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.