Related papers: Grounding Spatio-Temporal Language with Transformers

Grounding Spatio-Temporal Language with Transformers

URL: http://arxiv.org/abs/2106.08858v1
Date: Wed, 16 Jun 2021 15:28:22 GMT
Title: Grounding Spatio-Temporal Language with Transformers
Authors: Tristan Karch, Laetitia Teodorescu, Katja Hofmann, Cl\'ement Moulin-Frier and Pierre-Yves Oudeyer
Abstract summary: We introduce a novel-temporal language task to learn the meaning of behavioral traces of an embodied agent. This is achieved by training a function that predicts if a description matches a given history of observations. To study the role of architectural generalization in this task, we train several models including multimodal Transformer architectures.
Score: 22.46291815734606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to randomly held-out sentences; 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents. We also release our code under open-source license as well as pretrained models and datasets to encourage the wider community to build upon and extend our work in the future.

Related papers

Video-Language Critic: Transferable Reward Functions for Language-Conditioned Robotics [25.2461925479135]
Video-Language Critic is a reward model that can be trained on readily available cross-embodiment data. Our model enables 2x more sample-efficient policy training on Meta-World tasks than a sparse reward only.
arXiv Detail & Related papers (2024-05-30T12:18:06Z)
Visually Grounded Language Learning: a review of language games, datasets, tasks, and models [60.2604624857992]
Many Vision+Language (V+L) tasks have been defined with the aim of creating models that can ground symbols in the visual modality. In this work, we provide a systematic literature review of several tasks and models proposed in the V+L field.
arXiv Detail & Related papers (2023-12-05T02:17:29Z)
Learning to Model the World with Language [100.76069091703505]
To interact with humans and act in the world, agents need to understand the range of language that people use and relate it to the visual world. Our key idea is that agents should interpret such diverse language as a signal that helps them predict the future. We instantiate this in Dynalang, an agent that learns a multimodal world model to predict future text and image representations.
arXiv Detail & Related papers (2023-07-31T17:57:49Z)
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [72.83187997344406]
ARNOLD is a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.
arXiv Detail & Related papers (2023-04-09T21:42:57Z)
A Linguistic Investigation of Machine Learning based Contradiction Detection Models: An Empirical Analysis and Future Perspectives [0.34998703934432673]
We analyze two Natural Language Inference data sets with respect to their linguistic features. The goal is to identify those syntactic and semantic properties that are particularly hard to comprehend for a machine learning model.
arXiv Detail & Related papers (2022-10-19T10:06:03Z)
Is neural language acquisition similar to natural? A chronological probing study [0.0515648410037406]
We present the chronological probing study of transformer English models such as MultiBERT and T5. We compare the information about the language learned by the models in the process of training on corpora. The results show that 1) linguistic information is acquired in the early stages of training 2) both language models demonstrate capabilities to capture various features from various levels of language.
arXiv Detail & Related papers (2022-07-01T17:24:11Z)
Temporal Attention for Language Models [24.34396762188068]
We extend the key component of the transformer architecture, i.e., the self-attention mechanism, and propose temporal attention. temporal attention can be applied to any transformer model and requires the input texts to be accompanied with their relevant time points. We leverage these representations for the task of semantic change detection. Our proposed model achieves state-of-the-art results on all the datasets.
arXiv Detail & Related papers (2022-02-04T11:55:34Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages. We infer this distribution from a sample of typologically diverse training languages. We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z)
Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space. We present our approach of constructing analogy datasets in terms of words, phrases and sentences. We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.