Related papers: A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation

URL: http://arxiv.org/abs/2106.06854v2
Date: Mon, 13 Nov 2023 21:19:45 GMT
Title: A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation
Authors: Scott Fujimoto, David Meger, Doina Precup
Abstract summary: Marginalized importance sampling (MIS) measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
Score: 61.740187363451746
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.

Related papers

Optimal Policy Adaptation under Covariate Shift [15.703626346971182]
We propose principled approaches for learning the optimal policy in the target domain by leveraging two datasets. We derive the identifiability assumptions for the reward induced by a given policy. We then learn the optimal policy by optimizing the estimated reward.
arXiv Detail & Related papers (2025-01-14T12:33:02Z)
Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning [11.988291170853806]
We introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.
arXiv Detail & Related papers (2024-11-17T15:00:39Z)
Multi-Agent Reinforcement Learning from Human Feedback: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Multi-Agent Reinforcement Learning from Human Feedback (MARLHF), exploring both theoretical foundations and empirical validations. We define the task as identifying Nash equilibrium from a preference-only offline dataset in general-sum games. Our findings underscore the multifaceted approach required for MARLHF, paving the way for effective preference-based multi-agent systems.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
Mimicking Better by Matching the Approximate Action Distribution [48.95048003354255]
We introduce MAAD, a novel, sample-efficient on-policy algorithm for Imitation Learning from Observations. We show that it requires considerable fewer interactions to achieve expert performance, outperforming current state-of-the-art on-policy methods.
arXiv Detail & Related papers (2023-06-16T12:43:47Z)
PACER: A Fully Push-forward-based Distributional Reinforcement Learning Algorithm [28.48626438603237]
PACER consists of a distributional critic, an actor and a sample-based encourager. Push-forward operator is leveraged in both the critic and actor to model the return distributions and policies respectively. A sample-based utility value policy gradient is established for the push-forward policy update.
arXiv Detail & Related papers (2023-06-11T09:45:31Z)
Context-Aware Bayesian Network Actor-Critic Methods for Cooperative Multi-Agent Reinforcement Learning [7.784991832712813]
We introduce a Bayesian network to inaugurate correlations between agents' action selections in their joint policy. We develop practical algorithms to learn the context-aware Bayesian network policies. Empirical results on a range of MARL benchmarks show the benefits of our approach.
arXiv Detail & Related papers (2023-06-02T21:22:27Z)
Offline Reinforcement Learning with Instrumental Variables in Confounded Markov Decision Processes [93.61202366677526]
We study the offline reinforcement learning (RL) in the face of unmeasured confounders. We propose various policy learning methods with the finite-sample suboptimality guarantee of finding the optimal in-class policy.
arXiv Detail & Related papers (2022-09-18T22:03:55Z)
Imitation Learning by State-Only Distribution Matching [2.580765958706854]
Imitation Learning from observation describes policy learning in a similar way to human learning. We propose a non-adversarial learning-from-observations approach, together with an interpretable convergence and performance metric.
arXiv Detail & Related papers (2022-02-09T08:38:50Z)
Scalable Bayesian Inverse Reinforcement Learning [93.27920030279586]
We introduce Approximate Variational Reward Imitation Learning (AVRIL) Our method addresses the ill-posed nature of the inverse reinforcement learning problem. Applying our method to real medical data alongside classic control simulations, we demonstrate Bayesian reward inference in environments beyond the scope of current methods.
arXiv Detail & Related papers (2021-02-12T12:32:02Z)
Reliable Off-policy Evaluation for Reinforcement Learning [53.486680020852724]
In a sequential decision-making problem, off-policy evaluation estimates the expected cumulative reward of a target policy. We propose a novel framework that provides robust and optimistic cumulative reward estimates using one or multiple logged data.
arXiv Detail & Related papers (2020-11-08T23:16:19Z)
Stable Policy Optimization via Off-Policy Divergence Regularization [50.98542111236381]
Trust Region Policy Optimization (TRPO) and Proximal Policy Optimization (PPO) are among the most successful policy gradient approaches in deep reinforcement learning (RL) We propose a new algorithm which stabilizes the policy improvement through a proximity term that constrains the discounted state-action visitation distribution induced by consecutive policies to be close to one another. Our proposed method can have a beneficial effect on stability and improve final performance in benchmark high-dimensional control tasks.
arXiv Detail & Related papers (2020-03-09T13:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.