STRAPPER: Preference-based Reinforcement Learning via Self-training
Augmentation and Peer Regularization
- URL: http://arxiv.org/abs/2307.09692v1
- Date: Wed, 19 Jul 2023 00:31:58 GMT
- Title: STRAPPER: Preference-based Reinforcement Learning via Self-training
Augmentation and Peer Regularization
- Authors: Yachen Kang, Li He, Jinxin Liu, Zifeng Zhuang, Donglin Wang
- Abstract summary: Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference.
We present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions.
- Score: 18.811470043767713
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preference-based reinforcement learning (PbRL) promises to learn a complex
reward function with binary human preference. However, such human-in-the-loop
formulation requires considerable human effort to assign preference labels to
segment pairs, hindering its large-scale applications. Recent approache has
tried to reuse unlabeled segments, which implicitly elucidates the distribution
of segments and thereby alleviates the human effort. And consistency
regularization is further considered to improve the performance of
semi-supervised learning. However, we notice that, unlike general
classification tasks, in PbRL there exits a unique phenomenon that we defined
as similarity trap in this paper. Intuitively, human can have diametrically
opposite preferredness for similar segment pairs, but such similarity may trap
consistency regularization fail in PbRL. Due to the existence of similarity
trap, such consistency regularization improperly enhances the consistency
possiblity of the model's predictions between segment pairs, and thus reduces
the confidence in reward learning, since the augmented distribution does not
match with the original one in PbRL. To overcome such issue, we present a
self-training method along with our proposed peer regularization, which
penalizes the reward model memorizing uninformative labels and acquires
confident predictions. Empirically, we demonstrate that our approach is capable
of learning well a variety of locomotion and robotic manipulation behaviors
using different semi-supervised alternatives and peer regularization.
Related papers
- Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple
Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class.
Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z) - Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy.
At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z) - Supervised Contrastive Learning with Heterogeneous Similarity for
Distribution Shifts [3.7819322027528113]
We propose a new regularization using the supervised contrastive learning to prevent such overfitting and to train models that do not degrade their performance under the distribution shifts.
Experiments on benchmark datasets that emulate distribution shifts, including subpopulation shift and domain generalization, demonstrate the advantage of the proposed method.
arXiv Detail & Related papers (2023-04-07T01:45:09Z) - Neighbour Consistency Guided Pseudo-Label Refinement for Unsupervised
Person Re-Identification [80.98291772215154]
Unsupervised person re-identification (ReID) aims at learning discriminative identity features for person retrieval without any annotations.
Recent advances accomplish this task by leveraging clustering-based pseudo labels.
We propose a Neighbour Consistency guided Pseudo Label Refinement framework.
arXiv Detail & Related papers (2022-11-30T09:39:57Z) - Extending Momentum Contrast with Cross Similarity Consistency
Regularization [5.085461418671174]
We present Extended Momentum Contrast, a self-supervised representation learning method founded upon the legacy of the momentum-encoder unit proposed in the MoCo family configurations.
Under the cross consistency regularization rule, we argue that semantic representations associated with any pair of images (positive or negative) should preserve their cross-similarity.
We report a competitive performance on the standard Imagenet-1K linear head classification benchmark.
arXiv Detail & Related papers (2022-06-07T20:06:56Z) - Contrastive Learning for Fair Representations [50.95604482330149]
Trained classification models can unintentionally lead to biased representations and predictions.
Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise.
We propose a method for mitigating bias by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations.
arXiv Detail & Related papers (2021-09-22T10:47:51Z) - Semi-supervised Contrastive Learning with Similarity Co-calibration [72.38187308270135]
We propose a novel training strategy, termed as Semi-supervised Contrastive Learning (SsCL)
SsCL combines the well-known contrastive loss in self-supervised learning with the cross entropy loss in semi-supervised learning.
We show that SsCL produces more discriminative representation and is beneficial to few shot learning.
arXiv Detail & Related papers (2021-05-16T09:13:56Z) - Causally-motivated Shortcut Removal Using Auxiliary Labels [63.686580185674195]
Key challenge to learning such risk-invariant predictors is shortcut learning.
We propose a flexible, causally-motivated approach to address this challenge.
We show both theoretically and empirically that this causally-motivated regularization scheme yields robust predictors.
arXiv Detail & Related papers (2021-05-13T16:58:45Z) - A Contraction Approach to Model-based Reinforcement Learning [11.701145942745274]
We analyze the error in the cumulative reward using a contraction approach.
We prove that branched rollouts can reduce this error.
In this case, we show that GAN-type learning has an advantage over Behavioral Cloning when its discriminator is well-trained.
arXiv Detail & Related papers (2020-09-18T02:03:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.