Learning Rewards from Linguistic Feedback
- URL: http://arxiv.org/abs/2009.14715v3
- Date: Sat, 3 Jul 2021 19:03:12 GMT
- Title: Learning Rewards from Linguistic Feedback
- Authors: Theodore R. Sumers, Mark K. Ho, Robert D. Hawkins, Karthik Narasimhan,
Thomas L. Griffiths
- Abstract summary: We explore unconstrained natural language feedback as a learning signal for artificial agents.
We implement three artificial learners: sentiment-based "literal" and "pragmatic" models, and an inference network trained end-to-end to predict latent rewards.
- Score: 30.30912759796109
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We explore unconstrained natural language feedback as a learning signal for
artificial agents. Humans use rich and varied language to teach, yet most prior
work on interactive learning from language assumes a particular form of input
(e.g., commands). We propose a general framework which does not make this
assumption, using aspect-based sentiment analysis to decompose feedback into
sentiment about the features of a Markov decision process. We then perform an
analogue of inverse reinforcement learning, regressing the sentiment on the
features to infer the teacher's latent reward function. To evaluate our
approach, we first collect a corpus of teaching behavior in a cooperative task
where both teacher and learner are human. We implement three artificial
learners: sentiment-based "literal" and "pragmatic" models, and an inference
network trained end-to-end to predict latent rewards. We then repeat our
initial experiment and pair them with human teachers. All three successfully
learn from interactive human feedback. The sentiment models outperform the
inference network, with the "pragmatic" model approaching human performance.
Our work thus provides insight into the information structure of naturalistic
linguistic feedback as well as methods to leverage it for reinforcement
learning.
Related papers
- Is Feedback All You Need? Leveraging Natural Language Feedback in
Goal-Conditioned Reinforcement Learning [54.31495290436766]
We extend BabyAI to automatically generate language feedback from environment dynamics and goal condition success.
We modify the Decision Transformer architecture to take advantage of this additional signal.
We find that training with language feedback either in place of or in addition to the return-to-go or goal descriptions improves agents' generalisation performance.
arXiv Detail & Related papers (2023-12-07T22:33:34Z) - Yes, this Way! Learning to Ground Referring Expressions into Actions
with Intra-episodic Feedback from Supportive Teachers [15.211628096103475]
We present an initial study that evaluates intra-episodic feedback given in a collaborative setting.
Our results show that intra-episodic feedback allows the follower to generalize on aspects of scene complexity.
arXiv Detail & Related papers (2023-05-22T10:01:15Z) - Training Language Models with Language Feedback at Scale [50.70091340506957]
We introduce learning from Language Feedback (ILF), a new approach that utilizes more informative language feedback.
ILF consists of three steps that are applied iteratively: first, conditioning the language model on the input, an initial LM output, and feedback to generate refinements.
We show theoretically that ILF can be viewed as Bayesian Inference, similar to Reinforcement Learning from human feedback.
arXiv Detail & Related papers (2023-03-28T17:04:15Z) - Chain of Hindsight Aligns Language Models with Feedback [62.68665658130472]
We propose a novel technique, Chain of Hindsight, that is easy to optimize and can learn from any form of feedback, regardless of its polarity.
We convert all types of feedback into sequences of sentences, which are then used to fine-tune the model.
By doing so, the model is trained to generate outputs based on feedback, while learning to identify and correct negative attributes or errors.
arXiv Detail & Related papers (2023-02-06T10:28:16Z) - Communication Drives the Emergence of Language Universals in Neural
Agents: Evidence from the Word-order/Case-marking Trade-off [3.631024220680066]
We propose a new Neural-agent Language Learning and Communication framework (NeLLCom) where pairs of speaking and listening agents first learn a miniature language.
We succeed in replicating the trade-off with the new framework without hard-coding specific biases in the agents.
arXiv Detail & Related papers (2023-01-30T17:22:33Z) - How to talk so your robot will learn: Instructions, descriptions, and
pragmatics [14.289220844201695]
We study how a human might communicate preferences over behaviors.
We show that in traditional reinforcement learning settings, pragmatic social learning can integrate with and accelerate individual learning.
Our findings suggest that social learning from a wider range of language is a promising approach for value alignment and reinforcement learning more broadly.
arXiv Detail & Related papers (2022-06-16T01:33:38Z) - Training Language Models with Natural Language Feedback [51.36137482891037]
We learn from language feedback on model outputs using a three-step learning algorithm.
In synthetic experiments, we first evaluate whether language models accurately incorporate feedback to produce refinements.
Using only 100 samples of human-written feedback, our learning algorithm finetunes a GPT-3 model to roughly human-level summarization.
arXiv Detail & Related papers (2022-04-29T15:06:58Z) - Grounding Hindsight Instructions in Multi-Goal Reinforcement Learning
for Robotics [14.863872352905629]
This paper focuses on robotic reinforcement learning with sparse rewards for natural language goal representations.
We first present a mechanism for hindsight instruction replay utilizing expert feedback.
Second, we propose a seq2seq model to generate linguistic hindsight instructions.
arXiv Detail & Related papers (2022-04-08T22:01:36Z) - Unsupervised Domain Adaptive Person Re-Identification via Human Learning
Imitation [67.52229938775294]
In past years, researchers propose to utilize the teacher-student framework in their methods to decrease the domain gap between different person re-identification datasets.
Inspired by recent teacher-student framework based methods, we propose to conduct further exploration to imitate the human learning process from different aspects.
arXiv Detail & Related papers (2021-11-28T01:14:29Z) - Bongard-LOGO: A New Benchmark for Human-Level Concept Learning and
Reasoning [78.13740873213223]
Bongard problems (BPs) were introduced as an inspirational challenge for visual cognition in intelligent systems.
We propose a new benchmark Bongard-LOGO for human-level concept learning and reasoning.
arXiv Detail & Related papers (2020-10-02T03:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.