Policy Learning Using Weak Supervision
- URL: http://arxiv.org/abs/2010.01748v3
- Date: Tue, 2 Nov 2021 13:59:27 GMT
- Title: Policy Learning Using Weak Supervision
- Authors: Jingkang Wang, Hongyi Guo, Zhaowei Zhu, Yang Liu
- Abstract summary: We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently.
Our approach explicitly punishes a policy for overfitting to the weak supervision.
In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy rewards, BC with weak demonstrations, and standard policy co-training show that our method leads to substantial performance improvements.
- Score: 18.540550726629995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most existing policy learning solutions require the learning agents to
receive high-quality supervision signals such as well-designed rewards in
reinforcement learning (RL) or high-quality expert demonstrations in behavioral
cloning (BC). These quality supervisions are usually infeasible or
prohibitively expensive to obtain in practice. We aim for a unified framework
that leverages the available cheap weak supervisions to perform policy learning
efficiently. To handle this problem, we treat the "weak supervision" as
imperfect information coming from a peer agent, and evaluate the learning
agent's policy based on a "correlated agreement" with the peer agent's policy
(instead of simple agreements). Our approach explicitly punishes a policy for
overfitting to the weak supervision. In addition to theoretical guarantees,
extensive evaluations on tasks including RL with noisy rewards, BC with weak
demonstrations, and standard policy co-training show that our method leads to
substantial performance improvements, especially when the complexity or the
noise of the learning environments is high.
Related papers
- Scaling Policy Compliance Assessment in Language Models with Policy Reasoning Traces [12.671657542087624]
Policy Reasoning Traces (PRT) is a form of specialized generated reasoning chains that serve as a reasoning bridge to improve an LLM's policy compliance assessment capabilities.<n>Our empirical evaluations demonstrate that the use of PRTs for both inference-time and training-time scenarios significantly enhances the performance of open-weight and commercial models.
arXiv Detail & Related papers (2025-09-27T13:10:21Z) - Guided Policy Optimization under Partial Observability [36.853129816484845]
Reinforcement Learning (RL) in partially observable environments poses significant challenges due to the complexity of learning under uncertainty.<n>We introduce Guided Policy Optimization (GPO), a framework that co-trains a guider and a learner.<n>We theoretically demonstrate that this learning scheme achieves optimality comparable to direct RL, thereby overcoming key limitations inherent in existing approaches.
arXiv Detail & Related papers (2025-05-21T12:01:08Z) - Provable Zero-Shot Generalization in Offline Reinforcement Learning [55.169228792596805]
We study offline reinforcement learning with zero-shot generalization property (ZSG)
Existing work showed that classical offline RL fails to generalize to new, unseen environments.
We show that both PERM and PPPO are capable of finding a near-optimal policy with ZSG.
arXiv Detail & Related papers (2025-03-11T02:44:32Z) - RILe: Reinforced Imitation Learning [60.63173816209543]
RILe is a novel trainer-student system that learns a dynamic reward function based on the student's performance and alignment with expert demonstrations.
RILe enables better performance in complex settings where traditional methods falter, outperforming existing methods by 2x in complex simulated robot-locomotion tasks.
arXiv Detail & Related papers (2024-06-12T17:56:31Z) - Belief-Enriched Pessimistic Q-Learning against Adversarial State
Perturbations [5.076419064097735]
Recent work shows that a well-trained RL agent can be easily manipulated by strategically perturbing its state observations at the test stage.
Existing solutions either introduce a regularization term to improve the smoothness of the trained policy against perturbations or alternatively train the agent's policy and the attacker's policy.
We propose a new robust RL algorithm for deriving a pessimistic policy to safeguard against an agent's uncertainty about true states.
arXiv Detail & Related papers (2024-03-06T20:52:49Z) - Adversarially Guided Subgoal Generation for Hierarchical Reinforcement
Learning [5.514236598436977]
We propose a novel HRL approach for mitigating the non-stationarity by adversarially enforcing the high-level policy to generate subgoals compatible with the current instantiation of the low-level policy.
Experiments with state-of-the-art algorithms show that our approach significantly improves learning efficiency and overall performance of HRL in various challenging continuous control tasks.
arXiv Detail & Related papers (2022-01-24T12:30:38Z) - Off-policy Reinforcement Learning with Optimistic Exploration and
Distribution Correction [73.77593805292194]
We train a separate exploration policy to maximize an approximate upper confidence bound of the critics in an off-policy actor-critic framework.
To mitigate the off-policy-ness, we adapt the recently introduced DICE framework to learn a distribution correction ratio for off-policy actor-critic training.
arXiv Detail & Related papers (2021-10-22T22:07:51Z) - Robust Reinforcement Learning on State Observations with Learned Optimal
Adversary [86.0846119254031]
We study the robustness of reinforcement learning with adversarially perturbed state observations.
With a fixed agent policy, we demonstrate that an optimal adversary to perturb state observations can be found.
For DRL settings, this leads to a novel empirical adversarial attack to RL agents via a learned adversary that is much stronger than previous ones.
arXiv Detail & Related papers (2021-01-21T05:38:52Z) - Robust Deep Reinforcement Learning through Adversarial Loss [74.20501663956604]
Recent studies have shown that deep reinforcement learning agents are vulnerable to small adversarial perturbations on the agent's inputs.
We propose RADIAL-RL, a principled framework to train reinforcement learning agents with improved robustness against adversarial attacks.
arXiv Detail & Related papers (2020-08-05T07:49:42Z) - Robust Deep Reinforcement Learning against Adversarial Perturbations on
State Observations [88.94162416324505]
A deep reinforcement learning (DRL) agent observes its states through observations, which may contain natural measurement errors or adversarial noises.
Since the observations deviate from the true states, they can mislead the agent into making suboptimal actions.
We show that naively applying existing techniques on improving robustness for classification tasks, like adversarial training, is ineffective for many RL tasks.
arXiv Detail & Related papers (2020-03-19T17:59:59Z) - Reward-Conditioned Policies [100.64167842905069]
imitation learning requires near-optimal expert data.
Can we learn effective policies via supervised learning without demonstrations?
We show how such an approach can be derived as a principled method for policy search.
arXiv Detail & Related papers (2019-12-31T18:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.