Intra-agent speech permits zero-shot task acquisition
- URL: http://arxiv.org/abs/2206.03139v1
- Date: Tue, 7 Jun 2022 09:28:10 GMT
- Title: Intra-agent speech permits zero-shot task acquisition
- Authors: Chen Yan, Federico Carnevale, Petko Georgiev, Adam Santoro, Aurelia
Guy, Alistair Muldal, Chia-Chun Hung, Josh Abramson, Timothy Lillicrap,
Gregory Wayne
- Abstract summary: We take inspiration from processes of "inner speech" in humans to better understand the role of intra-agent speech in embodied behavior.
We develop algorithms that enable visually grounded captioning with little labeled language data.
We incorporate intra-agent speech into an embodied, mobile manipulator agent operating in a 3D virtual world.
- Score: 13.19051572784014
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human language learners are exposed to a trickle of informative,
context-sensitive language, but a flood of raw sensory data. Through both
social language use and internal processes of rehearsal and practice, language
learners are able to build high-level, semantic representations that explain
their perceptions. Here, we take inspiration from such processes of "inner
speech" in humans (Vygotsky, 1934) to better understand the role of intra-agent
speech in embodied behavior. First, we formally pose intra-agent speech as a
semi-supervised problem and develop two algorithms that enable visually
grounded captioning with little labeled language data. We then experimentally
compute scaling curves over different amounts of labeled data and compare the
data efficiency against a supervised learning baseline. Finally, we incorporate
intra-agent speech into an embodied, mobile manipulator agent operating in a 3D
virtual world, and show that with as few as 150 additional image captions,
intra-agent speech endows the agent with the ability to manipulate and answer
questions about a new object without any related task-directed experience
(zero-shot). Taken together, our experiments suggest that modelling intra-agent
speech is effective in enabling embodied agents to learn new tasks efficiently
and without direct interaction experience.
Related papers
- Human-oriented Representation Learning for Robotic Manipulation [64.59499047836637]
Humans inherently possess generalizable visual representations that empower them to efficiently explore and interact with the environments in manipulation tasks.
We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders.
Our Task Fusion Decoder consistently improves the representation of three state-of-the-art visual encoders for downstream manipulation policy-learning.
arXiv Detail & Related papers (2023-10-04T17:59:38Z) - Universal Multimodal Representation for Language Understanding [110.98786673598015]
This work presents new methods to employ visual information as assistant signals to general NLP tasks.
For each sentence, we first retrieve a flexible number of images either from a light topic-image lookup table extracted over the existing sentence-image pairs.
Then, the text and images are encoded by a Transformer encoder and convolutional neural network, respectively.
arXiv Detail & Related papers (2023-01-09T13:54:11Z) - Bootstrapping meaning through listening: Unsupervised learning of spoken
sentence embeddings [4.582129557845177]
This study tackles the unsupervised learning of semantic representations for spoken utterances.
We propose WavEmbed, a sequential autoencoder that predicts hidden units from a dense representation of speech.
We also propose S-HuBERT to induce meaning through knowledge distillation.
arXiv Detail & Related papers (2022-10-23T21:16:09Z) - Know your audience: specializing grounded language models with listener
subtraction [20.857795779760917]
We take inspiration from Dixit to formulate a multi-agent image reference game.
We show that finetuning an attention-based adapter between a CLIP vision encoder and a large language model in this contrastive, multi-agent setting gives rise to context-dependent natural language specialization.
arXiv Detail & Related papers (2022-06-16T17:52:08Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
arXiv Detail & Related papers (2022-03-14T22:02:40Z) - Self-play for Data Efficient Language Acquisition [20.86261546611472]
We exploit the symmetric nature of communication in order to improve the efficiency and quality of language acquisition in learning agents.
We show that using self-play as a substitute for direct supervision enables the agent to transfer its knowledge across roles.
arXiv Detail & Related papers (2020-10-10T02:09:19Z) - Grounded Language Learning Fast and Slow [23.254765095715054]
We show that an embodied agent can exhibit similar one-shot word learning when trained with conventional reinforcement learning algorithms.
We find that, under certain training conditions, the agent's one-shot word-object binding generalizes to novel exemplars within the same ShapeNet category.
We further show how dual-coding memory can be exploited as a signal for intrinsic motivation, stimulating the agent to seek names for objects that may be useful for executing later instructions.
arXiv Detail & Related papers (2020-09-03T14:52:03Z) - On the interaction between supervision and self-play in emergent
communication [82.290338507106]
We investigate the relationship between two categories of learning signals with the ultimate goal of improving sample efficiency.
We find that first training agents via supervised learning on human data followed by self-play outperforms the converse.
arXiv Detail & Related papers (2020-02-04T02:35:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.