Reinforcement Learning from Statistical Feedback: the Journey from AB
Testing to ANT Testing
- URL: http://arxiv.org/abs/2311.14766v1
- Date: Fri, 24 Nov 2023 07:50:52 GMT
- Title: Reinforcement Learning from Statistical Feedback: the Journey from AB
Testing to ANT Testing
- Authors: Feiyang Han and Yimin Wei and Zhaofeng Liu and Yanxing Qi
- Abstract summary: Reinforcement Learning from Human Feedback (RLHF) has played a crucial role in the success of large models such as ChatGPT.
We will attempt to fill this gap with statistical business feedback instead of human feedback, using AB testing.
Statistical inference methods are used to obtain preferences for training the reward network, which fine-tunes the pre-trained model.
- Score: 1.1142354615369272
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement Learning from Human Feedback (RLHF) has played a crucial role
in the success of large models such as ChatGPT. RLHF is a reinforcement
learning framework which combines human feedback to improve learning
effectiveness and performance. However, obtaining preferences feedback manually
is quite expensive in commercial applications. Some statistical commercial
indicators are usually more valuable and always ignored in RLHF. There exists a
gap between commercial target and model training. In our research, we will
attempt to fill this gap with statistical business feedback instead of human
feedback, using AB testing which is a well-established statistical method.
Reinforcement Learning from Statistical Feedback (RLSF) based on AB testing is
proposed. Statistical inference methods are used to obtain preferences for
training the reward network, which fine-tunes the pre-trained model in
reinforcement learning framework, achieving greater business value.
Furthermore, we extend AB testing with double selections at a single time-point
to ANT testing with multiple selections at different feedback time points.
Moreover, we design numerical experiences to validate the effectiveness of our
algorithm framework.
Related papers
- BoostAdapter: Improving Vision-Language Test-Time Adaptation via Regional Bootstrapping [64.8477128397529]
We propose a training-required and training-free test-time adaptation framework.
We maintain a light-weight key-value memory for feature retrieval from instance-agnostic historical samples and instance-aware boosting samples.
We theoretically justify the rationality behind our method and empirically verify its effectiveness on both the out-of-distribution and the cross-domain datasets.
arXiv Detail & Related papers (2024-10-20T15:58:43Z) - Efficient Preference-based Reinforcement Learning via Aligned Experience Estimation [37.36913210031282]
Preference-based reinforcement learning (PbRL) has shown impressive capabilities in training agents without reward engineering.
We propose SEER, an efficient PbRL method that integrates label smoothing and policy regularization techniques.
arXiv Detail & Related papers (2024-05-29T01:49:20Z) - Towards Understanding the Influence of Reward Margin on Preference Model Performance [8.891183078634786]
This study introduces a novel method to estimate the preference differences without the need for detailed, exhaustive labels from human annotators.
Our experimental results provide empirical evidence that incorporating margin values into the training process significantly improves the effectiveness of reward models.
arXiv Detail & Related papers (2024-04-07T12:10:04Z) - Sample Efficient Reinforcement Learning from Human Feedback via Active
Exploration [29.935758027209292]
Preference-based feedback is important for many applications in reinforcement learning.
In this work, we take advantage of the fact that one can often choose contexts to obtain human feedback.
We show that our method is able to reach better performance with fewer samples of human preferences than multiple baselines.
arXiv Detail & Related papers (2023-12-01T00:54:02Z) - Responsible Active Learning via Human-in-the-loop Peer Study [88.01358655203441]
We propose a responsible active learning method, namely Peer Study Learning (PSL), to simultaneously preserve data privacy and improve model stability.
We first introduce a human-in-the-loop teacher-student architecture to isolate unlabelled data from the task learner (teacher) on the cloud-side.
During training, the task learner instructs the light-weight active learner which then provides feedback on the active sampling criterion.
arXiv Detail & Related papers (2022-11-24T13:18:27Z) - Out-of-Distribution Detection with Hilbert-Schmidt Independence
Optimization [114.43504951058796]
Outlier detection tasks have been playing a critical role in AI safety.
Deep neural network classifiers usually tend to incorrectly classify out-of-distribution (OOD) inputs into in-distribution classes with high confidence.
We propose an alternative probabilistic paradigm that is both practically useful and theoretically viable for the OOD detection tasks.
arXiv Detail & Related papers (2022-09-26T15:59:55Z) - Reward Uncertainty for Exploration in Preference-based Reinforcement
Learning [88.34958680436552]
We present an exploration method specifically for preference-based reinforcement learning algorithms.
Our main idea is to design an intrinsic reward by measuring the novelty based on learned reward.
Our experiments show that exploration bonus from uncertainty in learned reward improves both feedback- and sample-efficiency of preference-based RL algorithms.
arXiv Detail & Related papers (2022-05-24T23:22:10Z) - Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback [8.409764908043396]
We apply preference modeling and reinforcement learning from human feedback to finetune language models to act as helpful assistants.
We find this alignment training improves performance on almost all NLP evaluations.
We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data.
arXiv Detail & Related papers (2022-04-12T15:02:38Z) - SURF: Semi-supervised Reward Learning with Data Augmentation for
Feedback-efficient Preference-based Reinforcement Learning [168.89470249446023]
We present SURF, a semi-supervised reward learning framework that utilizes a large amount of unlabeled samples with data augmentation.
In order to leverage unlabeled samples for reward learning, we infer pseudo-labels of the unlabeled samples based on the confidence of the preference predictor.
Our experiments demonstrate that our approach significantly improves the feedback-efficiency of the preference-based method on a variety of locomotion and robotic manipulation tasks.
arXiv Detail & Related papers (2022-03-18T16:50:38Z) - Facial Feedback for Reinforcement Learning: A Case Study and Offline
Analysis Using the TAMER Framework [51.237191651923666]
We investigate the potential of agent learning from trainers' facial expressions via interpreting them as evaluative feedback.
With designed CNN-RNN model, our analysis shows that telling trainers to use facial expressions and competition can improve the accuracies for estimating positive and negative feedback.
Our results with a simulation experiment show that learning solely from predicted feedback based on facial expressions is possible.
arXiv Detail & Related papers (2020-01-23T17:50:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.