Learning New Skills after Deployment: Improving open-domain
internet-driven dialogue with human feedback
- URL: http://arxiv.org/abs/2208.03270v1
- Date: Fri, 5 Aug 2022 16:41:46 GMT
- Title: Learning New Skills after Deployment: Improving open-domain
internet-driven dialogue with human feedback
- Authors: Jing Xu, Megan Ung, Mojtaba Komeili, Kushal Arora, Y-Lan Boureau,
Jason Weston
- Abstract summary: We study how to improve internet-driven conversational skills in a learning framework.
We collect deployment data, and collect various types of human feedback.
We find the recently introduced Director model shows significant improvements over other existing approaches.
- Score: 22.92577324751342
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Frozen models trained to mimic static datasets can never improve their
performance. Models that can employ internet-retrieval for up-to-date
information and obtain feedback from humans during deployment provide the
promise of both adapting to new information, and improving their performance.
In this work we study how to improve internet-driven conversational skills in
such a learning framework. We collect deployment data, which we make publicly
available, of human interactions, and collect various types of human feedback
-- including binary quality measurements, free-form text feedback, and
fine-grained reasons for failure. We then study various algorithms for
improving from such feedback, including standard supervised learning, rejection
sampling, model-guiding and reward-based learning, in order to make
recommendations on which type of feedback and algorithms work best. We find the
recently introduced Director model (Arora et al., '22) shows significant
improvements over other existing approaches.
Related papers
- Human-Feedback Efficient Reinforcement Learning for Online Diffusion Model Finetuning [21.707688492630304]
HERO is an online training method that captures human feedback and provides informative learning signals for fine-tuning.
HERO can effectively handle tasks like reasoning, counting, personalization, and reducing NSFW content with only 0.5K online feedback.
arXiv Detail & Related papers (2024-10-07T15:12:01Z) - UltraFeedback: Boosting Language Models with Scaled AI Feedback [99.4633351133207]
We present textscUltraFeedback, a large-scale, high-quality, and diversified AI feedback dataset.
Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models.
arXiv Detail & Related papers (2023-10-02T17:40:01Z) - Bridging the Gap: A Survey on Integrating (Human) Feedback for Natural
Language Generation [68.9440575276396]
This survey aims to provide an overview of the recent research that has leveraged human feedback to improve natural language generation.
First, we introduce an encompassing formalization of feedback, and identify and organize existing research into a taxonomy following this formalization.
Second, we discuss how feedback can be described by its format and objective, and cover the two approaches proposed to use feedback (either for training or decoding): directly using the feedback or training feedback models.
Third, we provide an overview of the nascent field of AI feedback, which exploits large language models to make judgments based on a set of principles and minimize the need for
arXiv Detail & Related papers (2023-05-01T17:36:06Z) - Leveraging Demonstrations to Improve Online Learning: Quality Matters [54.98983862640944]
We show that the degree of improvement must depend on the quality of the demonstration data.
We propose an informed TS algorithm that utilizes the demonstration data in a coherent way through Bayes' rule.
arXiv Detail & Related papers (2023-02-07T08:49:12Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Training a Helpful and Harmless Assistant with Reinforcement Learning
from Human Feedback [8.409764908043396]
We apply preference modeling and reinforcement learning from human feedback to finetune language models to act as helpful assistants.
We find this alignment training improves performance on almost all NLP evaluations.
We explore an iterated online mode of training, where preference models and RL policies are updated on a weekly cadence with fresh human feedback data.
arXiv Detail & Related papers (2022-04-12T15:02:38Z) - Interactive Machine Learning for Image Captioning [8.584932159968002]
We propose an approach for interactive learning for an image captioning model.
We envision a system that exploits human feedback as good as possible by multiplying the feedback using data augmentation methods.
arXiv Detail & Related papers (2022-02-28T09:02:32Z) - Teaching with Commentaries [108.62722733649542]
We propose a flexible teaching framework using commentaries and learned meta-information.
We find that commentaries can improve training speed and/or performance.
commentaries can be reused when training new models to obtain performance benefits.
arXiv Detail & Related papers (2020-11-05T18:52:46Z) - Enhancing Dialogue Generation via Multi-Level Contrastive Learning [57.005432249952406]
We propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query.
A Rank-aware (RC) network is designed to construct the multi-level contrastive optimization objectives.
We build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words.
arXiv Detail & Related papers (2020-09-19T02:41:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.