Related papers: Reward Shaping with Dynamic Trajectory Aggregation

Reward Shaping with Dynamic Trajectory Aggregation

URL: http://arxiv.org/abs/2104.06163v1
Date: Tue, 13 Apr 2021 13:07:48 GMT
Title: Reward Shaping with Dynamic Trajectory Aggregation
Authors: Takato Okudo and Seiji Yamada
Abstract summary: Potential-based reward shaping is a basic method for enriching rewards. SARSA-RS learns the potential function and acquires it. We propose a trajectory aggregation that uses subgoal series.
Score: 7.6146285961466
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reinforcement learning, which acquires a policy maximizing long-term rewards, has been actively studied. Unfortunately, this learning type is too slow and difficult to use in practical situations because the state-action space becomes huge in real environments. The essential factor for learning efficiency is rewards. Potential-based reward shaping is a basic method for enriching rewards. This method is required to define a specific real-value function called a potential function for every domain. It is often difficult to represent the potential function directly. SARSA-RS learns the potential function and acquires it. However, SARSA-RS can only be applied to the simple environment. The bottleneck of this method is the aggregation of states to make abstract states since it is almost impossible for designers to build an aggregation function for all states. We propose a trajectory aggregation that uses subgoal series. This method dynamically aggregates states in an episode during trial and error with only the subgoal series and subgoal identification function. It makes designer effort minimal and the application to environments with high-dimensional observations possible. We obtained subgoal series from participants for experiments. We conducted the experiments in three domains, four-rooms(discrete states and discrete actions), pinball(continuous and discrete), and picking(both continuous). We compared our method with a baseline reinforcement learning algorithm and other subgoal-based methods, including random subgoal and naive subgoal-based reward shaping. As a result, our reward shaping outperformed all other methods in learning efficiency.

Related papers

Outcome-Based Online Reinforcement Learning: Algorithms and Fundamental Limits [58.63897489864948]
Reinforcement learning with outcome-based feedback faces a fundamental challenge.<n>How do we assign credit to the right actions?<n>This paper provides the first comprehensive analysis of this problem in online RL with general function approximation.
arXiv Detail & Related papers (2025-05-26T17:44:08Z)
Subgoal Discovery Using a Free Energy Paradigm and State Aggregations [5.13730975608994]
Reinforcement learning (RL) plays a major role in solving complex sequential decision-making tasks. Subgoal discovery is a key component for task decomposition of these methods. Our proposed method can be applied for subgoal discovery without prior knowledge of the task.
arXiv Detail & Related papers (2024-12-21T16:26:47Z)
STARC: A General Framework For Quantifying Differences Between Reward Functions [55.33869271912095]
We provide a class of pseudometrics on the space of all reward functions that we call STARC metrics. We show that STARC metrics induce both an upper and a lower bound on worst-case regret. We also identify a number of issues with reward metrics proposed by earlier works.
arXiv Detail & Related papers (2023-09-26T20:31:19Z)
Basis for Intentions: Efficient Inverse Reinforcement Learning using Past Experience [89.30876995059168]
inverse reinforcement learning (IRL) -- inferring the reward function of an agent from observing its behavior. This paper addresses the problem of IRL -- inferring the reward function of an agent from observing its behavior.
arXiv Detail & Related papers (2022-08-09T17:29:49Z)
Probability Density Estimation Based Imitation Learning [11.262633728487165]
Imitation Learning (IL) is an effective learning paradigm exploiting the interactions between agents and environments. In this work, a novel reward function based on probability density estimation is proposed for IRL. We present a "watch-try-learn" style framework named Probability Density Estimation based Imitation Learning (PDEIL)
arXiv Detail & Related papers (2021-12-13T15:55:38Z)
Flow Network based Generative Models for Non-Iterative Diverse Candidate Generation [110.09855163856326]
This paper is about the problem of learning a policy for generating an object from a sequence of actions. We propose GFlowNet, based on a view of the generative process as a flow network. We prove that any global minimum of the proposed objectives yields a policy which samples from the desired distribution.
arXiv Detail & Related papers (2021-06-08T14:21:10Z)
Subgoal-based Reward Shaping to Improve Efficiency in Reinforcement Learning [7.6146285961466]
We extend potential-based reward shaping and propose a subgoal-based reward shaping. Our method makes it easier for human trainers to share their knowledge of subgoals.
arXiv Detail & Related papers (2021-04-13T14:28:48Z)
Replacing Rewards with Examples: Example-Based Policy Search via Recursive Classification [133.20816939521941]
In the standard Markov decision process formalism, users specify tasks by writing down a reward function. In many scenarios, the user is unable to describe the task in words or numbers, but can readily provide examples of what the world would look like if the task were solved. Motivated by this observation, we derive a control algorithm that aims to visit states that have a high probability of leading to successful outcomes, given only examples of successful outcome states.
arXiv Detail & Related papers (2021-03-23T16:19:55Z)
f-IRL: Inverse Reinforcement Learning via State Marginal Matching [13.100127636586317]
We propose a method for learning the reward function (and the corresponding policy) to match the expert state density. We present an algorithm, f-IRL, that recovers a stationary reward function from the expert density by gradient descent. Our method outperforms adversarial imitation learning methods in terms of sample efficiency and the required number of expert trajectories.
arXiv Detail & Related papers (2020-11-09T19:37:48Z)
Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
Active Preference-Based Gaussian Process Regression for Reward Learning [42.697198807877925]
One common approach is to learn reward functions from collected expert demonstrations. We present a preference-based learning approach, where as an alternative, the human feedback is only in the form of comparisons between trajectories. Our approach enables us to tackle both inflexibility and data-inefficiency problems within a preference-based learning framework.
arXiv Detail & Related papers (2020-05-06T03:29:27Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
We propose a new "reward-free RL" framework to isolate the challenges of exploration. We give an efficient algorithm that conducts $tildemathcalO(S2Amathrmpoly(H)/epsilon2)$ episodes of exploration. We also give a nearly-matching $Omega(S2AH2/epsilon2)$ lower bound, demonstrating the near-optimality of our algorithm in this setting.
arXiv Detail & Related papers (2020-02-07T14:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.