Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense
Spatiotemporal Grounding
- URL: http://arxiv.org/abs/2010.07954v1
- Date: Thu, 15 Oct 2020 18:01:15 GMT
- Title: Room-Across-Room: Multilingual Vision-and-Language Navigation with Dense
Spatiotemporal Grounding
- Authors: Alexander Ku and Peter Anderson and Roma Patel and Eugene Ie and Jason
Baldridge
- Abstract summary: We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation (VLN) dataset.
RxR is multilingual (English, Hindi, and Telugu) and larger (more paths and instructions) than other VLN datasets.
It emphasizes the role of language in VLN by addressing known biases in paths and eliciting more references to visible entities.
- Score: 75.03682706791389
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Room-Across-Room (RxR), a new Vision-and-Language Navigation
(VLN) dataset. RxR is multilingual (English, Hindi, and Telugu) and larger
(more paths and instructions) than other VLN datasets. It emphasizes the role
of language in VLN by addressing known biases in paths and eliciting more
references to visible entities. Furthermore, each word in an instruction is
time-aligned to the virtual poses of instruction creators and validators. We
establish baseline scores for monolingual and multilingual settings and
multitask learning when including Room-to-Room annotations. We also provide
results for a model that learns from synchronized pose traces by focusing only
on portions of the panorama attended to in human demonstrations. The size,
scope and detail of RxR dramatically expands the frontier for research on
embodied language agents in simulated, photo-realistic environments.
Related papers
- Thank You, Stingray: Multilingual Large Language Models Can Not (Yet) Disambiguate Cross-Lingual Word Sense [30.62699081329474]
We introduce a novel benchmark for cross-lingual sense disambiguation, StingrayBench.
We collect false friends in four language pairs, namely Indonesian-Malay, Indonesian-Tagalog, Chinese-Japanese, and English-German.
In our analysis of various models, we observe they tend to be biased toward higher-resource languages.
arXiv Detail & Related papers (2024-10-28T22:09:43Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - CLEAR: Improving Vision-Language Navigation with Cross-Lingual,
Environment-Agnostic Representations [98.30038910061894]
Vision-and-Language Navigation (VLN) tasks require an agent to navigate through the environment based on language instructions.
We propose CLEAR: Cross-Lingual and Environment-Agnostic Representations.
Our language and visual representations can be successfully transferred to the Room-to-Room and Cooperative Vision-and-Dialogue Navigation task.
arXiv Detail & Related papers (2022-07-05T17:38:59Z) - The Geometry of Multilingual Language Model Representations [25.880639246639323]
We assess how multilingual language models maintain a shared multilingual representation space while still encoding language-sensitive information in each language.
The subspace means differ along language-sensitive axes that are relatively stable throughout middle layers, and these axes encode information such as token vocabularies.
We visualize representations projected onto language-sensitive and language-neutral axes, identifying language family and part-of-speech clusters, along with spirals, toruses, and curves representing token position information.
arXiv Detail & Related papers (2022-05-22T23:58:24Z) - IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and
Languages [87.5457337866383]
We introduce the Image-Grounded Language Understanding Evaluation benchmark.
IGLUE brings together visual question answering, cross-modal retrieval, grounded reasoning, and grounded entailment tasks across 20 diverse languages.
We find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
arXiv Detail & Related papers (2022-01-27T18:53:22Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - UC2: Universal Cross-lingual Cross-modal Vision-and-Language
Pre-training [52.852163987208826]
UC2 is the first machine translation-augmented framework for cross-lingual cross-modal representation learning.
We propose two novel pre-training tasks, namely Masked Region-to-Token Modeling (MRTM) and Visual Translation Language Modeling (VTLM)
Our proposed framework achieves new state-of-the-art on diverse non-English benchmarks while maintaining comparable performance to monolingual pre-trained models on English tasks.
arXiv Detail & Related papers (2021-04-01T08:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.