Grounded Language Learning Fast and Slow
- URL: http://arxiv.org/abs/2009.01719v4
- Date: Wed, 14 Oct 2020 14:38:58 GMT
- Title: Grounded Language Learning Fast and Slow
- Authors: Felix Hill, Olivier Tieleman, Tamara von Glehn, Nathaniel Wong, Hamza
Merzic, Stephen Clark
- Abstract summary: We show that an embodied agent can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms.
We find that, under certain training conditions, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category.
We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for executing later instructions.
- Score: 23.254765095715054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that large text-based neural language models, trained
with conventional supervised learning objectives, acquire a surprising
propensity for few- and one-shot learning. Here, we show that an embodied agent
situated in a simulated 3D world, and endowed with a novel dual-coding external
memory, can exhibit similar one-shot word learning when trained with
conventional reinforcement learning algorithms. After a single introduction to
a novel object via continuous visual perception and a language prompt ("This is
a dax"), the agent can re-identify the object and manipulate it as instructed
("Put the dax on the bed"). In doing so, it seamlessly integrates short-term,
within-episode knowledge of the appropriate referent for the word "dax" with
long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and
"putting"). We find that, under certain training conditions and with a
particular memory writing mechanism, the agent's one-shot word-object binding
generalizes to novel exemplars within the same ShapeNet category, and is
effective in settings with unfamiliar numbers of objects. We further show how
dual-coding memory can be exploited as a signal for intrinsic motivation,
stimulating the agent to seek names for objects that may be useful for later
executing instructions. Together, the results demonstrate that deep neural
networks can exploit meta-learning, episodic memory and an explicitly
multi-modal environment to account for 'fast-mapping', a fundamental pillar of
human cognitive development and a potentially transformative capacity for
agents that interact with human users.
Related papers
- ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension [71.03445074045092]
We propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens.
Our method unifies the prompt and answer of visual referential tasks without using additional syntax.
ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency.
arXiv Detail & Related papers (2024-06-17T08:39:16Z) - An iterated learning model of language change that mixes supervised and unsupervised learning [0.0]
The iterated learning model is an agent model which simulates the transmission of of language from generation to generation.
In each iteration, a language tutor exposes a na"ive pupil to a limited training set of utterances, each pairing a random meaning with the signal that conveys it.
The transmission bottleneck ensures that tutors must generalize beyond the training set that they experienced.
arXiv Detail & Related papers (2024-05-31T14:14:01Z) - Generalized Robot 3D Vision-Language Model with Fast Rendering and Pre-Training Vision-Language Alignment [55.11291053011696]
This work presents a framework for dealing with 3D scene understanding when the labeled scenes are quite limited.
To extract knowledge for novel categories from the pre-trained vision-language models, we propose a hierarchical feature-aligned pre-training and knowledge distillation strategy.
In the limited reconstruction case, our proposed approach, termed WS3D++, ranks 1st on the large-scale ScanNet benchmark.
arXiv Detail & Related papers (2023-12-01T15:47:04Z) - Communication Drives the Emergence of Language Universals in Neural
Agents: Evidence from the Word-order/Case-marking Trade-off [3.631024220680066]
We propose a new Neural-agent Language Learning and Communication framework (NeLLCom) where pairs of speaking and listening agents first learn a miniature language.
We succeed in replicating the trade-off with the new framework without hard-coding specific biases in the agents.
arXiv Detail & Related papers (2023-01-30T17:22:33Z) - Pretraining on Interactions for Learning Grounded Affordance
Representations [22.290431852705662]
We train a neural network to predict objects' trajectories in a simulated interaction.
We show that our network's latent representations differentiate between both observed and unobserved affordances.
Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.
arXiv Detail & Related papers (2022-07-05T19:19:53Z) - Intra-agent speech permits zero-shot task acquisition [13.19051572784014]
We take inspiration from processes of "inner speech" in humans to better understand the role of intra-agent speech in embodied behavior.
We develop algorithms that enable visually grounded captioning with little labeled language data.
We incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world.
arXiv Detail & Related papers (2022-06-07T09:28:10Z) - Fast Concept Mapping: The Emergence of Human Abilities in Artificial
Neural Networks when Learning Embodied and Self-Supervised [0.0]
We introduce a setup in which an artificial agent first learns in a simulated world through self-supervised exploration.
We use a method we call fast concept mapping which uses correlated firing patterns of neurons to define and detect semantic concepts.
arXiv Detail & Related papers (2021-02-03T17:19:49Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Learning Adaptive Language Interfaces through Decomposition [89.21937539950966]
We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition.
Users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps.
arXiv Detail & Related papers (2020-10-11T08:27:07Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.