Grounded Language Learning Fast and Slow
- URL: http://arxiv.org/abs/2009.01719v4
- Date: Wed, 14 Oct 2020 14:38:58 GMT
- Title: Grounded Language Learning Fast and Slow
- Authors: Felix Hill, Olivier Tieleman, Tamara von Glehn, Nathaniel Wong, Hamza
Merzic, Stephen Clark
- Abstract summary: We show that an embodied agent can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms.
We find that, under certain training conditions, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category.
We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for executing later instructions.
- Score: 23.254765095715054
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has shown that large text-based neural language models, trained
with conventional supervised learning objectives, acquire a surprising
propensity for few- and one-shot learning. Here, we show that an embodied agent
situated in a simulated 3D world, and endowed with a novel dual-coding external
memory, can exhibit similar one-shot word learning when trained with
conventional reinforcement learning algorithms. After a single introduction to
a novel object via continuous visual perception and a language prompt ("This is
a dax"), the agent can re-identify the object and manipulate it as instructed
("Put the dax on the bed"). In doing so, it seamlessly integrates short-term,
within-episode knowledge of the appropriate referent for the word "dax" with
long-term lexical and motor knowledge acquired across episodes (i.e. "bed" and
"putting"). We find that, under certain training conditions and with a
particular memory writing mechanism, the agent's one-shot word-object binding
generalizes to novel exemplars within the same ShapeNet category, and is
effective in settings with unfamiliar numbers of objects. We further show how
dual-coding memory can be exploited as a signal for intrinsic motivation,
stimulating the agent to seek names for objects that may be useful for later
executing instructions. Together, the results demonstrate that deep neural
networks can exploit meta-learning, episodic memory and an explicitly
multi-modal environment to account for 'fast-mapping', a fundamental pillar of
human cognitive development and a potentially transformative capacity for
agents that interact with human users.
Related papers
- Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - Communication Drives the Emergence of Language Universals in Neural
Agents: Evidence from the Word-order/Case-marking Trade-off [3.631024220680066]
We propose a new Neural-agent Language Learning and Communication framework (NeLLCom) where pairs of speaking and listening agents first learn a miniature language.
We succeed in replicating the trade-off with the new framework without hard-coding specific biases in the agents.
arXiv Detail & Related papers (2023-01-30T17:22:33Z) - Multi-Object Navigation with dynamically learned neural implicit
representations [10.182418917501064]
We propose to structure neural networks with two neural implicit representations, which are learned dynamically during each episode.
We evaluate the agent on Multi-Object Navigation and show the high impact of using neural implicit representations as a memory source.
arXiv Detail & Related papers (2022-10-11T04:06:34Z) - Pretraining on Interactions for Learning Grounded Affordance
Representations [22.290431852705662]
We train a neural network to predict objects' trajectories in a simulated interaction.
We show that our network's latent representations differentiate between both observed and unobserved affordances.
Our results suggest a way in which modern deep learning approaches to grounded language learning can be integrated with traditional formal semantic notions of lexical representations.
arXiv Detail & Related papers (2022-07-05T19:19:53Z) - Intra-agent speech permits zero-shot task acquisition [13.19051572784014]
We take inspiration from processes of "inner speech" in humans to better understand the role of intra-agent speech in embodied behavior.
We develop algorithms that enable visually grounded captioning with little labeled language data.
We incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world.
arXiv Detail & Related papers (2022-06-07T09:28:10Z) - K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
arXiv Detail & Related papers (2022-04-20T04:47:01Z) - Fast Concept Mapping: The Emergence of Human Abilities in Artificial
Neural Networks when Learning Embodied and Self-Supervised [0.0]
We introduce a setup in which an artificial agent first learns in a simulated world through self-supervised exploration.
We use a method we call fast concept mapping which uses correlated firing patterns of neurons to define and detect semantic concepts.
arXiv Detail & Related papers (2021-02-03T17:19:49Z) - Neuro-Symbolic Representations for Video Captioning: A Case for
Leveraging Inductive Biases for Vision and Language [148.0843278195794]
We propose a new model architecture for learning multi-modal neuro-symbolic representations for video captioning.
Our approach uses a dictionary learning-based method of learning relations between videos and their paired text descriptions.
arXiv Detail & Related papers (2020-11-18T20:21:19Z) - Learning Adaptive Language Interfaces through Decomposition [89.21937539950966]
We introduce a neural semantic parsing system that learns new high-level abstractions through decomposition.
Users interactively teach the system by breaking down high-level utterances describing novel behavior into low-level steps.
arXiv Detail & Related papers (2020-10-11T08:27:07Z) - COBE: Contextualized Object Embeddings from Narrated Instructional Video [52.73710465010274]
We propose a new framework for learning Contextualized OBject Embeddings from automatically-transcribed narrations of instructional videos.
We leverage the semantic and compositional structure of language by training a visual detector to predict a contextualized word embedding of the object and its associated narration.
Our experiments show that our detector learns to predict a rich variety of contextual object information, and that it is highly effective in the settings of few-shot and zero-shot learning.
arXiv Detail & Related papers (2020-07-14T19:04:08Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.