Related papers: Efficient Imitation under Misspecification

Efficient Imitation under Misspecification

URL: http://arxiv.org/abs/2503.13162v2
Date: Wed, 02 Apr 2025 16:32:52 GMT
Title: Efficient Imitation under Misspecification
Authors: Nicolas Espinosa-Dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy,
Abstract summary: We consider the problem of imitation learning under misspecification.<n>We propose inverse reinforcement learning algorithms that merely perform a computationally efficient local search procedure with strong guarantees in the realizable setting.<n>We prove that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play.
Score: 17.706710359787056
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We consider the problem of imitation learning under misspecification: settings where the learner is fundamentally unable to replicate expert behavior everywhere. This is often true in practice due to differences in observation space and action space expressiveness (e.g. perceptual or morphological differences between robots and humans). Given the learner must make some mistakes in the misspecified setting, interaction with the environment is fundamentally required to figure out which mistakes are particularly costly and lead to compounding errors. However, given the computational cost and safety concerns inherent in interaction, we'd like to perform as little of it as possible while ensuring we've learned a strong policy. Accordingly, prior work has proposed a flavor of efficient inverse reinforcement learning algorithms that merely perform a computationally efficient local search procedure with strong guarantees in the realizable setting. We first prove that under a novel structural condition we term reward-agnostic policy completeness, these sorts of local-search based IRL algorithms are able to avoid compounding errors. We then consider the question of where we should perform local search in the first place, given the learner may not be able to "walk on a tightrope" as well as the expert in the misspecified setting. We prove that in the misspecified setting, it is beneficial to broaden the set of states on which local search is performed to include those reachable by good policies the learner can actually play. We then experimentally explore a variety of sources of misspecification and how offline data can be used to effectively broaden where we perform local search from.

Related papers

Probably Approximately Precision and Recall Learning [62.912015491907994]
Precision and Recall are foundational metrics in machine learning. One-sided feedback--where only positive examples are observed during training--is inherent in many practical problems. We introduce a PAC learning framework where each hypothesis is represented by a graph, with edges indicating positive interactions.
arXiv Detail & Related papers (2024-11-20T04:21:07Z)
Is Behavior Cloning All You Need? Understanding Horizon in Imitation Learning [26.53136644321385]
Imitation learning (IL) aims to mimic the behavior of an expert in a sequential decision making task by learning from demonstrations.<n>We show that it is possible to achieve horizon-independent sample complexity in offline IL whenever the range of the cumulative payoffs is controlled.<n>Specializing in deterministic, stationary policies, we show that the gap between offline and online IL is smaller than previously thought.
arXiv Detail & Related papers (2024-07-20T23:31:56Z)
RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning. Our proposed method uses reinforcement learning with user intervention signals themselves as rewards. This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z)
A Simple Solution for Offline Imitation from Observations and Examples with Possibly Incomplete Trajectories [122.11358440078581]
offline imitation is useful in real-world scenarios where arbitrary interactions are costly and expert actions are unavailable. We propose Trajectory-Aware Learning from Observations (TAILO) to solve MDPs where only task-specific expert states and task-agnostic non-expert state-action pairs are available.
arXiv Detail & Related papers (2023-11-02T15:41:09Z)
Diversity Through Exclusion (DTE): Niche Identification for Reinforcement Learning through Value-Decomposition [63.67574523750839]
We propose a generic reinforcement learning (RL) algorithm that performs better than baseline deep Q-learning algorithms in environments with multiple variably-valued niches. We show that agents trained this way can escape poor-but-attractive local optima to instead converge to harder-to-discover higher value strategies.
arXiv Detail & Related papers (2023-02-02T16:00:19Z)
Policy learning "without" overlap: Pessimism and generalized empirical Bernstein's inequality [94.89246810243053]
This paper studies offline policy learning, which aims at utilizing observations collected a priori to learn an optimal individualized decision rule. Existing policy learning methods rely on a uniform overlap assumption, i.e., the propensities of exploring all actions for all individual characteristics must be lower bounded. We propose Pessimistic Policy Learning (PPL), a new algorithm that optimize lower confidence bounds (LCBs) instead of point estimates.
arXiv Detail & Related papers (2022-12-19T22:43:08Z)
Deconfounding Imitation Learning with Variational Inference [19.99248795957195]
Standard imitation learning can fail when the expert demonstrators have different sensory inputs than the imitating agent. This is because partial observability gives rise to hidden confounders in the causal graph. We propose to train a variational inference model to infer the expert's latent information and use it to train a latent-conditional policy.
arXiv Detail & Related papers (2022-11-04T18:00:02Z)
Offline Policy Optimization with Eligible Actions [34.4530766779594]
offline policy optimization could have a large impact on many real-world decision-making problems. Importance sampling and its variants are a commonly used type of estimator in offline policy evaluation. We propose an algorithm to avoid this overfitting through a new per-state-neighborhood normalization constraint.
arXiv Detail & Related papers (2022-07-01T19:18:15Z)
Data augmentation for efficient learning from parametric experts [88.33380893179697]
We focus on what we call the policy cloning setting, in which we use online or offline queries of an expert to inform the behavior of a student policy. Our approach, augmented policy cloning (APC), uses synthetic states to induce feedback-sensitivity in a region around sampled trajectories. We achieve highly data-efficient transfer of behavior from an expert to a student policy for high-degrees-of-freedom control problems.
arXiv Detail & Related papers (2022-05-23T16:37:16Z)
MURAL: Meta-Learning Uncertainty-Aware Rewards for Outcome-Driven Reinforcement Learning [65.52675802289775]
We show that an uncertainty aware classifier can solve challenging reinforcement learning problems. We propose a novel method for computing the normalized maximum likelihood (NML) distribution. We show that the resulting algorithm has a number of intriguing connections to both count-based exploration methods and prior algorithms for learning reward functions.
arXiv Detail & Related papers (2021-07-15T08:19:57Z)
DERAIL: Diagnostic Environments for Reward And Imitation Learning [9.099589602551573]
We develop a suite of diagnostic tasks that test individual facets of algorithm performance in isolation. Results confirm that algorithm performance is highly sensitive to implementation details. Case-study shows how the suite can pinpoint design flaws and rapidly evaluate candidate solutions.
arXiv Detail & Related papers (2020-12-02T18:07:09Z)
Learning to Actively Learn: A Robust Approach [14.46867518436922]
This work proposes a procedure for designing algorithms for adaptive data collection tasks like active learning and pure-exploration multi-armed bandits. Our adaptive algorithm is learned via adversarial training over equivalence classes of problems derived from information theoretic lower bounds. We perform synthetic experiments to justify the stability and effectiveness of the training procedure, and then evaluate the method on tasks derived from real data.
arXiv Detail & Related papers (2020-10-29T06:48:22Z)
Provably Efficient Reward-Agnostic Navigation with Linear Value Iteration [143.43658264904863]
We show how iteration under a more standard notion of low inherent Bellman error, typically employed in least-square value-style algorithms, can provide strong PAC guarantees on learning a near optimal value function. We present a computationally tractable algorithm for the reward-free setting and show how it can be used to learn a near optimal policy for any (linear) reward function.
arXiv Detail & Related papers (2020-08-18T04:34:21Z)
LoCo: Local Contrastive Representation Learning [93.98029899866866]
We show that by overlapping local blocks stacking on top of each other, we effectively increase the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks. This simple design closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time.
arXiv Detail & Related papers (2020-08-04T05:41:29Z)
Strictly Batch Imitation Learning by Energy-based Distribution Matching [104.33286163090179]
Consider learning a policy purely on the basis of demonstrated behavior -- that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy, implicitly learn from rollout dynamics, and operate in an entirely offline fashion.
arXiv Detail & Related papers (2020-06-25T03:27:59Z)
Guided Uncertainty-Aware Policy Optimization: Combining Learning and Model-Based Strategies for Sample-Efficient Policy Learning [75.56839075060819]
Traditional robotic approaches rely on an accurate model of the environment, a detailed description of how to perform the task, and a robust perception system to keep track of the current state. reinforcement learning approaches can operate directly from raw sensory inputs with only a reward signal to describe the task, but are extremely sample-inefficient and brittle. In this work, we combine the strengths of model-based methods with the flexibility of learning-based methods to obtain a general method that is able to overcome inaccuracies in the robotics perception/actuation pipeline.
arXiv Detail & Related papers (2020-05-21T19:47:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.