Grounding Spatio-Temporal Language with Transformers
- URL: http://arxiv.org/abs/2106.08858v1
- Date: Wed, 16 Jun 2021 15:28:22 GMT
- Title: Grounding Spatio-Temporal Language with Transformers
- Authors: Tristan Karch, Laetitia Teodorescu, Katja Hofmann, Cl\'ement
Moulin-Frier and Pierre-Yves Oudeyer
- Abstract summary: We introduce a novel-temporal language task to learn the meaning of behavioral traces of an embodied agent.
This is achieved by training a function that predicts if a description matches a given history of observations.
To study the role of architectural generalization in this task, we train several models including multimodal Transformer architectures.
- Score: 22.46291815734606
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Language is an interface to the outside world. In order for embodied agents
to use it, language must be grounded in other, sensorimotor modalities. While
there is an extended literature studying how machines can learn grounded
language, the topic of how to learn spatio-temporal linguistic concepts is
still largely uncharted. To make progress in this direction, we here introduce
a novel spatio-temporal language grounding task where the goal is to learn the
meaning of spatio-temporal descriptions of behavioral traces of an embodied
agent. This is achieved by training a truth function that predicts if a
description matches a given history of observations. The descriptions involve
time-extended predicates in past and present tense as well as spatio-temporal
references to objects in the scene. To study the role of architectural biases
in this task, we train several models including multimodal Transformer
architectures; the latter implement different attention computations between
words and objects across space and time. We test models on two classes of
generalization: 1) generalization to randomly held-out sentences; 2)
generalization to grammar primitives. We observe that maintaining object
identity in the attention computation of our Transformers is instrumental to
achieving good performance on generalization overall, and that summarizing
object traces in a single token has little influence on performance. We then
discuss how this opens new perspectives for language-guided autonomous embodied
agents. We also release our code under open-source license as well as
pretrained models and datasets to encourage the wider community to build upon
and extend our work in the future.
Related papers
- Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics [25.2461925479135]
Video-Language Critic is a reward model that can be trained on readily available cross-embodiment data.
Our model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only.
arXiv Detail & Related papers (2024-05-30T12:18:06Z) - Visually Grounded Language Learning: a review of language games,
datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality.
In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z) - Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world.
Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future.
We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z) - ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous
States in Realistic 3D Scenes [72.83187997344406]
ARNOLD is a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes.
ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.
arXiv Detail & Related papers (2023-04-09T21:42:57Z) - A Linguistic Investigation of Machine Learning based Contradiction
Detection Models: An Empirical Analysis and Future Perspectives [0.34998703934432673]
We analyze two Natural Language Inference data sets with respect to their linguistic features.
The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model.
arXiv Detail & Related papers (2022-10-19T10:06:03Z) - Is neural language acquisition similar to natural? A chronological
probing study [0.0515648410037406]
We present the chronological probing study of transformer English models such as MultiBERT and T5.
We compare the information about the language learned by the models in the process of training on corpora.
The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language.
arXiv Detail & Related papers (2022-07-01T17:24:11Z) - Temporal Attention for Language Models [24.34396762188068]
We extend the key component of the transformer architecture, i.e., the self-attention mechanism, and propose temporal attention.
temporal attention can be applied to any transformer model and requires the input texts to be accompanied with their relevant time points.
We leverage these representations for the task of semantic change detection.
Our proposed model achieves state-of-the-art results on all the datasets.
arXiv Detail & Related papers (2022-02-04T11:55:34Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.