Automatic Reward Shaping from Confounded Offline Data
- URL: http://arxiv.org/abs/2505.11478v1
- Date: Fri, 16 May 2025 17:40:01 GMT
- Title: Automatic Reward Shaping from Confounded Offline Data
- Authors: Mingxuan Li, Junzhe Zhang, Elias Bareinboim,
- Abstract summary: Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data.<n>We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
- Score: 69.11672390876763
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A key task in Artificial Intelligence is learning effective policies for controlling agents in unknown environments to optimize performance measures. Off-policy learning methods, like Q-learning, allow learners to make optimal decisions based on past experiences. This paper studies off-policy learning from biased data in complex and high-dimensional domains where \emph{unobserved confounding} cannot be ruled out a priori. Building on the well-celebrated Deep Q-Network (DQN), we propose a novel deep reinforcement learning algorithm robust to confounding biases in observed data. Specifically, our algorithm attempts to find a safe policy for the worst-case environment compatible with the observations. We apply our method to twelve confounded Atari games, and find that it consistently dominates the standard DQN in all games where the observed input to the behavioral and target policies mismatch and unobserved confounders exist.
Related papers
- UNIQ: Offline Inverse Q-learning for Avoiding Undesirable Demonstrations [11.666700714916065]
We address the problem of offline learning a policy that avoids undesirable demonstrations.
We formulate the learning task as maximizing a statistical distance between the learning policy and the undesirable policy.
Our algorithm, UNIQ, tackles these challenges by building on the inverse Q-learning framework.
arXiv Detail & Related papers (2024-10-10T18:52:58Z) - No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks.
This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics.
We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z) - Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline
Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error.
In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z) - Privacy Risks in Reinforcement Learning for Household Robots [42.675213619562975]
Privacy emerges as a pivotal concern within the realm of embodied AI, as the robot accesses substantial personal information.<n>This paper proposes an attack on the training process of the value-based algorithm and the gradient-based algorithm, utilizing gradient inversion to reconstruct states, actions, and supervisory signals.
arXiv Detail & Related papers (2023-06-15T16:53:26Z) - A Unified Framework of Policy Learning for Contextual Bandit with
Confounding Bias and Missing Observations [108.89353070722497]
We study the offline contextual bandit problem, where we aim to acquire an optimal policy using observational data.
We present a new algorithm called Causal-Adjusted Pessimistic (CAP) policy learning, which forms the reward function as the solution of an integral equation system.
arXiv Detail & Related papers (2023-03-20T15:17:31Z) - Statistically Efficient Advantage Learning for Offline Reinforcement
Learning in Infinite Horizons [16.635744815056906]
We consider reinforcement learning methods in offline domains without additional online data collection, such as mobile health applications.
The proposed method takes an optimal Q-estimator computed by any existing state-of-the-art RL algorithms as input, and outputs a new policy whose value is guaranteed to converge at a faster rate than the policy derived based on the initial Q-estimator.
arXiv Detail & Related papers (2022-02-26T15:29:46Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Combining Online Learning and Offline Learning for Contextual Bandits
with Deficient Support [53.11601029040302]
Current offline-policy learning algorithms are mostly based on inverse propensity score (IPS) weighting.
We propose a novel approach that uses a hybrid of offline learning with online exploration.
Our approach determines an optimal policy with theoretical guarantees using the minimal number of online explorations.
arXiv Detail & Related papers (2021-07-24T05:07:43Z) - IQ-Learn: Inverse soft-Q Learning for Imitation [95.06031307730245]
imitation learning from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics.
Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence.
We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function.
arXiv Detail & Related papers (2021-06-23T03:43:10Z) - ConQUR: Mitigating Delusional Bias in Deep Q-learning [45.21332566843924]
Delusional bias is a fundamental source of error in approximate Q-learning.
We develop efficient methods to mitigate delusional bias by training Q-approximators with labels that are "consistent" with the underlying greedy policy class.
arXiv Detail & Related papers (2020-02-27T19:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.