Vision-and-Language Navigation Generative Pretrained Transformer
- URL: http://arxiv.org/abs/2405.16994v1
- Date: Mon, 27 May 2024 09:42:04 GMT
- Title: Vision-and-Language Navigation Generative Pretrained Transformer
- Authors: Wen Hanlin,
- Abstract summary: Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT)
Adopts transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules.
Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In the Vision-and-Language Navigation (VLN) field, agents are tasked with navigating real-world scenes guided by linguistic instructions. Enabling the agent to adhere to instructions throughout the process of navigation represents a significant challenge within the domain of VLN. To address this challenge, common approaches often rely on encoders to explicitly record past locations and actions, increasing model complexity and resource consumption. Our proposal, the Vision-and-Language Navigation Generative Pretrained Transformer (VLN-GPT), adopts a transformer decoder model (GPT2) to model trajectory sequence dependencies, bypassing the need for historical encoding modules. This method allows for direct historical information access through trajectory sequence, enhancing efficiency. Furthermore, our model separates the training process into offline pre-training with imitation learning and online fine-tuning with reinforcement learning. This distinction allows for more focused training objectives and improved performance. Performance assessments on the VLN dataset reveal that VLN-GPT surpasses complex state-of-the-art encoder-based models.
Related papers
- AIGeN: An Adversarial Approach for Instruction Generation in VLN [35.932836008492174]
We propose AIGeN, a novel architecture inspired by Generative Adrial Networks (GANs) that produces meaningful and well-formed synthetic instructions to improve navigation agents' performance.
We generate synthetic instructions for 217K trajectories using AIGeN on Habitat-Matterport 3D dataset (HM3D) and show an improvement in the performance of an off-the-shelf VLN method.
arXiv Detail & Related papers (2024-04-15T18:00:30Z) - OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation [96.46961207887722]
OVER-NAV aims to go over and beyond the current arts of IVLN techniques.
To fully exploit the interpreted navigation data, we introduce a structured representation, coded Omnigraph.
arXiv Detail & Related papers (2024-03-26T02:34:48Z) - Continual Vision-and-Language Navigation [18.20829279972436]
Vision-and-Language Navigation (VLN) agents navigate to a destination using natural language instructions and the visual information they observe.
Existing methods for training VLN agents presuppose fixed datasets, leading to a significant limitation.
We present the Continual Vision-and-Language Navigation (CVLN) paradigm, designed to evaluate agents trained through a continual learning process.
arXiv Detail & Related papers (2024-03-22T09:15:36Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - VLN-Video: Utilizing Driving Videos for Outdoor Vision-and-Language
Navigation [59.3649071376364]
The performance of existing VLN methods is limited by insufficient diversity in navigation environments and limited training data.
We propose VLN-Video, which utilizes the diverse outdoor environments present in driving videos in multiple cities in the U.S. augmented with automatically generated navigation instructions.
VLN-Video significantly outperforms previous state-of-the-art models by 2.1% in task completion rate.
arXiv Detail & Related papers (2024-02-05T22:20:19Z) - ViNT: A Foundation Model for Visual Navigation [52.2571739391896]
Visual Navigation Transformer (ViNT) is a foundation model for vision-based robotic navigation.
ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset.
It exhibits positive transfer, outperforming specialist models trained on singular datasets.
arXiv Detail & Related papers (2023-06-26T16:57:03Z) - PASTS: Progress-Aware Spatio-Temporal Transformer Speaker For
Vision-and-Language Navigation [6.11362142120604]
Vision-and-language navigation (VLN) is a crucial but challenging cross-modal navigation task.
One powerful technique to enhance the performance in VLN is the use of an independent speaker model to provide pseudo instructions for data augmentation.
We propose a novel term-aware-temporal transformer speaker (PASTS) model that uses transformer as the core of the network.
arXiv Detail & Related papers (2023-05-19T02:25:56Z) - Goal-Guided Transformer-Enabled Reinforcement Learning for Efficient
Autonomous Navigation [15.501449762687148]
We present a Goal-guided Transformer-enabled reinforcement learning (GTRL) approach for goal-driven navigation.
Our approach motivates the scene representation to concentrate mainly on goal-relevant features, which substantially enhances the data efficiency of the DRL learning process.
Both simulation and real-world experimental results manifest the superiority of our approach in terms of data efficiency, performance, robustness, and sim-to-real generalization.
arXiv Detail & Related papers (2023-01-01T07:14:30Z) - A Recurrent Vision-and-Language BERT for Navigation [54.059606864535304]
We propose a recurrent BERT model that is time-aware for use in vision-and-language navigation.
Our model can replace more complex encoder-decoder models to achieve state-of-the-art results.
arXiv Detail & Related papers (2020-11-26T00:23:00Z) - Towards Learning a Generic Agent for Vision-and-Language Navigation via
Pre-training [150.35927365127176]
We present the first pre-training and fine-tuning paradigm for vision-and-language navigation (VLN) tasks.
By training on a large amount of image-text-action triplets in a self-supervised learning manner, the pre-trained model provides generic representations of visual environments and language instructions.
It learns more effectively in new tasks and generalizes better in a previously unseen environment.
arXiv Detail & Related papers (2020-02-25T03:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.