Related papers: Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration

Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration

URL: http://arxiv.org/abs/2305.19476v3
Date: Thu, 8 Aug 2024 19:48:17 GMT
Title: Accelerating Reinforcement Learning with Value-Conditional State Entropy Exploration
Authors: Dongyoung Kim, Jinwoo Shin, Pieter Abbeel, Younggyo Seo,
Abstract summary: A promising technique for exploration is to maximize the entropy of visited state distribution. It tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states. We present a novel exploration technique that maximizes the value-conditional state entropy.
Score: 97.19464604735802
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A promising technique for exploration is to maximize the entropy of visited state distribution, i.e., state entropy, by encouraging uniform coverage of visited state space. While it has been effective for an unsupervised setup, it tends to struggle in a supervised setup with a task reward, where an agent prefers to visit high-value states to exploit the task reward. Such a preference can cause an imbalance between the distributions of high-value states and low-value states, which biases exploration towards low-value state regions as a result of the state entropy increasing when the distribution becomes more uniform. This issue is exacerbated when high-value states are narrowly distributed within the state space, making it difficult for the agent to complete the tasks. In this paper, we present a novel exploration technique that maximizes the value-conditional state entropy, which separately estimates the state entropies that are conditioned on the value estimates of each state, then maximizes their average. By only considering the visited states with similar value estimates for computing the intrinsic bonus, our method prevents the distribution of low-value states from affecting exploration around high-value states, and vice versa. We demonstrate that the proposed alternative to the state entropy baseline significantly accelerates various reinforcement learning algorithms across a variety of tasks within MiniGrid, DeepMind Control Suite, and Meta-World benchmarks. Source code is available at https://sites.google.com/view/rl-vcse.

Related papers

General detectability measure [53.64687146666141]
Distinguishing resource states from resource-free states is a fundamental task in quantum information. We derived the optimal exponential decay rate of the failure probability for detecting a given $n$-tensor product state.
arXiv Detail & Related papers (2025-01-16T05:39:22Z)
Off-Policy Maximum Entropy RL with Future State and Action Visitation Measures [1.75493501156941]
We introduce a new maximum entropy reinforcement learning framework based on the distribution of states and actions visited by a policy. For each state and action, this intrinsic reward is the relative entropy of the discounted distribution of states and actions visited during the next time steps.
arXiv Detail & Related papers (2024-12-09T16:56:06Z)
The Limits of Pure Exploration in POMDPs: When the Observation Entropy is Enough [40.82741665804367]
We study a simple approach of maximizing the entropy over observations in place true latent states. We show how knowledge of the latter can be exploited to compute a regularization of the observation entropy to improve principled performance.
arXiv Detail & Related papers (2024-06-18T17:00:13Z)
How to Explore with Belief: State Entropy Maximization in POMDPs [40.82741665804367]
We develop a memory and efficient *policy* method to address a first-order relaxation of the objective defined on ** states. This paper aims to generalize state entropy to more realistic domains that meet the challenges of applications.
arXiv Detail & Related papers (2024-06-04T13:16:34Z)
Modeling State Shifting via Local-Global Distillation for Event-Frame Gaze Tracking [61.44701715285463]
This paper tackles the problem of passive gaze estimation using both event and frame data. We reformulate gaze estimation as the quantification of the state shifting from the current state to several prior registered anchor states. To improve the generalization ability, instead of learning a large gaze estimation network directly, we align a group of local experts with a student network.
arXiv Detail & Related papers (2024-03-31T03:30:37Z)
Efficient Reinforcement Learning with Impaired Observability: Learning to Act with Delayed and Missing State Observations [92.25604137490168]
This paper introduces a theoretical investigation into efficient reinforcement learning in control systems. We present algorithms and establish near-optimal regret upper and lower bounds, of the form $tildemathcalO(sqrtrm poly(H) SAK)$, for RL in the delayed and missing observation settings.
arXiv Detail & Related papers (2023-06-02T02:46:39Z)
Scaling Marginalized Importance Sampling to High-Dimensional State-Spaces via State Abstraction [5.150752343250592]
We consider the problem of off-policy evaluation in reinforcement learning (RL) We propose to improve the accuracy of OPE estimators by projecting the high-dimensional state-space into a low-dimensional state-space.
arXiv Detail & Related papers (2022-12-14T20:07:33Z)
Distributed Q-Learning with State Tracking for Multi-agent Networked Control [61.63442612938345]
This paper studies distributed Q-learning for Linear Quadratic Regulator (LQR) in a multi-agent network. We devise a state tracking (ST) based Q-learning algorithm to design optimal controllers for agents.
arXiv Detail & Related papers (2020-12-22T22:03:49Z)
A New Bandit Setting Balancing Information from State Evolution and Corrupted Context [52.67844649650687]
We propose a new sequential decision-making setting combining key aspects of two established online learning problems with bandit feedback. The optimal action to play at any given moment is contingent on an underlying changing state which is not directly observable by the agent. We present an algorithm that uses a referee to dynamically combine the policies of a contextual bandit and a multi-armed bandit.
arXiv Detail & Related papers (2020-11-16T14:35:37Z)
Exploring Unknown States with Action Balance [48.330318997735574]
Exploration is a key problem in reinforcement learning. Next-state bonus methods force the agent to pay overmuch attention in exploring known states. We propose action balance exploration, which balances the frequency of selecting each action at a given state.
arXiv Detail & Related papers (2020-03-10T03:32:28Z)
Witnessing Negative Conditional Entropy [0.0]
We prove the existence of a Hermitian operator for the detection of states having negative conditional entropy for bipartite systems. We find that for a particular witness, the estimated tight upper bound matches the value of conditional entropy for Werner states.
arXiv Detail & Related papers (2020-01-30T10:08:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.