Reward Shaping with Dynamic Trajectory Aggregation
- URL: http://arxiv.org/abs/2104.06163v1
- Date: Tue, 13 Apr 2021 13:07:48 GMT
- Title: Reward Shaping with Dynamic Trajectory Aggregation
- Authors: Takato Okudo and Seiji Yamada
- Abstract summary: Potential-based reward shaping is a basic method for enriching rewards.
SARSA-RS learns the potential function and acquires it.
We propose a trajectory aggregation that uses subgoal series.
- Score: 7.6146285961466
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning, which acquires a policy maximizing long-term rewards,
has been actively studied. Unfortunately, this learning type is too slow and
difficult to use in practical situations because the state-action space becomes
huge in real environments. The essential factor for learning efficiency is
rewards. Potential-based reward shaping is a basic method for enriching
rewards. This method is required to define a specific real-value function
called a potential function for every domain. It is often difficult to
represent the potential function directly. SARSA-RS learns the potential
function and acquires it. However, SARSA-RS can only be applied to the simple
environment. The bottleneck of this method is the aggregation of states to make
abstract states since it is almost impossible for designers to build an
aggregation function for all states. We propose a trajectory aggregation that
uses subgoal series. This method dynamically aggregates states in an episode
during trial and error with only the subgoal series and subgoal identification
function. It makes designer effort minimal and the application to environments
with high-dimensional observations possible. We obtained subgoal series from
participants for experiments. We conducted the experiments in three domains,
four-rooms(discrete states and discrete actions), pinball(continuous and
discrete), and picking(both continuous). We compared our method with a baseline
reinforcement learning algorithm and other subgoal-based methods, including
random subgoal and naive subgoal-based reward shaping. As a result, our reward
shaping outperformed all other methods in learning efficiency.
Related papers
- STARC: A General Framework For Quantifying Differences Between Reward
Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics.
We show that STARC metrics induce both an upper and a lower bound on worst-case regret.
We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z) - Basis for Intentions: Efficient Inverse Reinforcement Learning using
Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior.
This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z) - Probability Density Estimation Based Imitation Learning [11.262633728487165]
Imitation Learning (IL) is an effective learning paradigm exploiting the interactions between agents and environments.
In this work, a novel reward function based on probability density estimation is proposed for IRL.
We present a "watch-try-learn" style framework named Probability Density Estimation based Imitation Learning (PDEIL)
arXiv Detail & Related papers (2021-12-13T15:55:38Z) - Flow Network based Generative Models for Non-Iterative Diverse Candidate
Generation [110.09855163856326]
This paper is about the problem of learning a policy for generating an object from a sequence of actions.
We propose GFlowNet, based on a view of the generative process as a flow network.
We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution.
arXiv Detail & Related papers (2021-06-08T14:21:10Z) - Subgoal-based Reward Shaping to Improve Efficiency in Reinforcement
Learning [7.6146285961466]
We extend potential-based reward shaping and propose a subgoal-based reward shaping.
Our method makes it easier for human trainers to share their knowledge of subgoals.
arXiv Detail & Related papers (2021-04-13T14:28:48Z) - Replacing Rewards with Examples: Example-Based Policy Search via
Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function.
In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved.
Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z) - f-IRL: Inverse Reinforcement Learning via State Marginal Matching [13.100127636586317]
We propose a method for learning the reward function (and the corresponding policy) to match the expert state density.
We present an algorithm, f-IRL, that recovers a stationary reward function from the expert density by gradient descent.
Our method outperforms adversarial imitation learning methods in terms of sample efficiency and the required number of expert trajectories.
arXiv Detail & Related papers (2020-11-09T19:37:48Z) - Provably Efficient Reward-Agnostic Navigation with Linear Value
Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function.
We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z) - Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations.
We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories.
Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z) - Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration.
We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration.
We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.