Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
- URL: http://arxiv.org/abs/2409.05583v1
- Date: Mon, 9 Sep 2024 13:12:11 GMT
- Title: Spatially-Aware Speaker for Vision-and-Language Navigation Instruction Generation
- Authors: Muraleekrishna Gopinathan, Martin Masek, Jumana Abu-Khalaf, David Suter,
- Abstract summary: SAS (Spatially-Aware Speaker) is an instruction generator that uses both structural and semantic knowledge of the environment to produce richer instructions.
Our method outperforms existing instruction generation models, evaluated using standard metrics.
- Score: 8.931633531104021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Embodied AI aims to develop robots that can \textit{understand} and execute human language instructions, as well as communicate in natural languages. On this front, we study the task of generating highly detailed navigational instructions for the embodied robots to follow. Although recent studies have demonstrated significant leaps in the generation of step-by-step instructions from sequences of images, the generated instructions lack variety in terms of their referral to objects and landmarks. Existing speaker models learn strategies to evade the evaluation metrics and obtain higher scores even for low-quality sentences. In this work, we propose SAS (Spatially-Aware Speaker), an instruction generator or \textit{Speaker} model that utilises both structural and semantic knowledge of the environment to produce richer instructions. For training, we employ a reward learning method in an adversarial setting to avoid systematic bias introduced by language evaluation metrics. Empirically, our method outperforms existing instruction generation models, evaluated using standard metrics. Our code is available at \url{https://github.com/gmuraleekrishna/SAS}.
Related papers
- Object-Centric Instruction Augmentation for Robotic Manipulation [29.491990994901666]
We introduce the textitObject-Centric Instruction Augmentation (OCI) framework to augment highly semantic and information-dense language instruction with position cues.
We utilize a Multi-modal Large Language Model (MLLM) to weave knowledge of object locations into natural language instruction.
We demonstrate that robotic manipulator imitation policies trained with our enhanced instructions outperform those relying solely on traditional language instructions.
arXiv Detail & Related papers (2024-01-05T13:54:45Z) - Accessible Instruction-Following Agent [0.0]
We introduce UVLN, a novel machine-translation instructional augmented framework for cross-lingual vision-language navigation.
We extend the standard VLN training objectives to a multilingual setting via a cross-lingual language encoder.
Experiments over Room Across Room dataset prove the effectiveness of our approach.
arXiv Detail & Related papers (2023-05-08T23:57:26Z) - Contrastive Language, Action, and State Pre-training for Robot Learning [1.1000499414131326]
We introduce a method for unifying language, action, and state information in a shared embedding space to facilitate a range of downstream tasks in robot learning.
Our method, Contrastive Language, Action, and State Pre-training (CLASP), extends the CLIP formulation by incorporating distributional learning, capturing the inherent complexities and one-to-many relationships in behaviour-text alignment.
We demonstrate the utility of our method for the following downstream tasks: zero-shot text-behaviour retrieval, captioning unseen robot behaviours, and learning a behaviour prior to language-conditioned reinforcement learning.
arXiv Detail & Related papers (2023-04-21T07:19:33Z) - Lana: A Language-Capable Navigator for Instruction Following and
Generation [70.76686546473994]
LANA is a language-capable navigation agent which is able to execute human-written navigation commands and provide route descriptions to humans.
We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description.
In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding.
arXiv Detail & Related papers (2023-03-15T07:21:28Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - FOAM: A Follower-aware Speaker Model For Vision-and-Language Navigation [45.99831101677059]
We present textscfoam, a textscFollower-textscaware speaker textscModel that is constantly updated given the follower feedback.
We optimize the speaker using a bi-level optimization framework and obtain its training signals by evaluating the follower on labeled data.
arXiv Detail & Related papers (2022-06-09T06:11:07Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Skill Induction and Planning with Latent Language [94.55783888325165]
We formulate a generative model of action sequences in which goals generate sequences of high-level subtask descriptions.
We describe how to train this model using primarily unannotated demonstrations by parsing demonstrations into sequences of named high-level subtasks.
In trained models, the space of natural language commands indexes a library of skills; agents can use these skills to plan by generating high-level instruction sequences tailored to novel goals.
arXiv Detail & Related papers (2021-10-04T15:36:32Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - On the Evaluation of Vision-and-Language Navigation Instructions [76.92085026018427]
Vision-and-Language Navigation wayfinding agents can be enhanced by exploiting automatically generated navigation instructions.
Existing instruction generators have not been comprehensively evaluated.
BLEU, ROUGE, METEOR and CIDEr are ineffective for evaluating grounded navigation instructions.
arXiv Detail & Related papers (2021-01-26T01:03:49Z) - The Turking Test: Can Language Models Understand Instructions? [45.266428794559495]
We present the Turking Test, which examines a model's ability to follow natural language instructions of varying complexity.
Despite our lenient evaluation methodology, we observe that a large pretrained language model performs poorly across all tasks.
arXiv Detail & Related papers (2020-10-22T18:44:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.