NavAI: A Generalizable LLM Framework for Navigation Tasks in Virtual Reality Environments
- URL: http://arxiv.org/abs/2601.03251v1
- Date: Tue, 06 Jan 2026 18:54:54 GMT
- Title: NavAI: A Generalizable LLM Framework for Navigation Tasks in Virtual Reality Environments
- Authors: Xue Qin, Matthew DiGiovanni,
- Abstract summary: NavAI is a generalizable large language model (LLM)-based navigation framework that supports both basic actions and complex goal-directed tasks.<n>We evaluate NavAI in three distinct VR environments through goal-oriented and exploratory tasks.<n>Results show that it achieves high accuracy, with an 89% success rate in goal-oriented tasks.
- Score: 0.6732076464377242
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Navigation is one of the fundamental tasks for automated exploration in Virtual Reality (VR). Existing technologies primarily focus on path optimization in 360-degree image datasets and 3D simulators, which cannot be directly applied to immersive VR environments. To address this gap, we present NavAI, a generalizable large language model (LLM)-based navigation framework that supports both basic actions and complex goal-directed tasks across diverse VR applications. We evaluate NavAI in three distinct VR environments through goal-oriented and exploratory tasks. Results show that it achieves high accuracy, with an 89% success rate in goal-oriented tasks. Our analysis also highlights current limitations of relying entirely on LLMs, particularly in scenarios that require dynamic goal assessment. Finally, we discuss the limitations observed during the experiments and offer insights for future research directions.
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - 3DGSNav: Enhancing Vision-Language Model Reasoning for Object Navigation via Active 3D Gaussian Splatting [12.057873540714098]
3DGSNav is a novel framework that embeds 3D Gaussian Splatting (3DGS) as persistent memory for vision-language models (VLMs) to enhance spatial reasoning.<n>3DGSNav incrementally constructs a 3DGS representation of the environment, enabling trajectory-guided free-viewpoint rendering of frontier-aware first-person views.<n>During navigation, a real-time object detector filters potential targets, while VLM-driven active viewpoint switching performs target re-verification.
arXiv Detail & Related papers (2026-02-12T16:41:26Z) - History-Augmented Vision-Language Models for Frontier-Based Zero-Shot Object Navigation [5.343932820859596]
This paper introduces a novel zero-shot ObjectNav framework that pioneers the use of dynamic, history-aware prompting.<n>Our core innovation lies in providing the VLM with action history context, enabling it to generate semantic guidance scores for navigation actions.<n>We also introduce a VLM-assisted waypoint generation mechanism for refining the final approach to detected objects.
arXiv Detail & Related papers (2025-06-19T21:50:16Z) - SemNav: A Model-Based Planner for Zero-Shot Object Goal Navigation Using Vision-Foundation Models [10.671262416557704]
Vision Foundation Models (VFMs) offer powerful capabilities for visual understanding and reasoning.<n>We present a zero-shot object goal navigation framework that integrates the perceptual strength of VFMs with a model-based planner.<n>We evaluate our approach on the HM3D dataset using the Habitat simulator and demonstrate that our method achieves state-of-the-art performance.
arXiv Detail & Related papers (2025-06-04T03:04:54Z) - AdaVLN: Towards Visual Language Navigation in Continuous Indoor Environments with Moving Humans [2.940962519388297]
We propose an extension to the task, termed Adaptive Visual Language Navigation (AdaVLN)<n>AdaVLN requires robots to navigate complex 3D indoor environments populated with dynamically moving human obstacles.<n>We evaluate several baseline models on this task, analyze the unique challenges introduced by AdaVLN, and demonstrate its potential to bridge the sim-to-real gap in VLN research.
arXiv Detail & Related papers (2024-11-27T17:36:08Z) - DivScene: Towards Open-Vocabulary Object Navigation with Large Vision Language Models in Diverse Scenes [76.24687327731031]
We first study the challenge of open-vocabulary object navigation by introducing DivScene.<n>Our dataset provides a much greater diversity of target objects and scene types than existing datasets.<n>We fine-tuned LVLMs to predict the next action with CoT explanations.
arXiv Detail & Related papers (2024-10-03T17:49:28Z) - Navigation with VLM framework: Towards Going to Any Language [4.368039454973151]
Vision Language Models (VLMs) have demonstrated remarkable capabilities to reason with both language and visual data.<n>We introduce Navigation with VLM (NavVLM), a training-free framework that harnesses open-source VLMs to enable robots to navigate effectively.
arXiv Detail & Related papers (2024-09-18T02:29:00Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning Disentangled Reasoning [97.88246428240872]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.<n>Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.<n>This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - CorNav: Autonomous Agent with Self-Corrected Planning for Zero-Shot Vision-and-Language Navigation [73.78984332354636]
CorNav is a novel zero-shot framework for vision-and-language navigation.
It incorporates environmental feedback for refining future plans and adjusting its actions.
It consistently outperforms all baselines in a zero-shot multi-task setting.
arXiv Detail & Related papers (2023-06-17T11:44:04Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - ESC: Exploration with Soft Commonsense Constraints for Zero-shot Object
Navigation [75.13546386761153]
We present a novel zero-shot object navigation method, Exploration with Soft Commonsense constraints (ESC)
ESC transfers commonsense knowledge in pre-trained models to open-world object navigation without any navigation experience.
Experiments on MP3D, HM3D, and RoboTHOR benchmarks show that our ESC method improves significantly over baselines.
arXiv Detail & Related papers (2023-01-30T18:37:32Z) - Improving Target-driven Visual Navigation with Attention on 3D Spatial
Relationships [52.72020203771489]
We investigate target-driven visual navigation using deep reinforcement learning (DRL) in 3D indoor scenes.
Our proposed method combines visual features and 3D spatial representations to learn navigation policy.
Our experiments, performed in the AI2-THOR, show that our model outperforms the baselines in both SR and SPL metrics.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.