An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance
- URL: http://arxiv.org/abs/2508.16602v1
- Date: Sun, 10 Aug 2025 15:13:23 GMT
- Title: An Embodied AR Navigation Agent: Integrating BIM with Retrieval-Augmented Generation for Language Guidance
- Authors: Hsuan-Kung Yang, Tsu-Ching Hsiao, Ryoichiro Oka, Ryuya Nishino, Satoko Tofukuji, Norimasa Kobori,
- Abstract summary: We propose an embodied AR navigation system that supports flexible, language-driven goal retrieval and route planning.<n>The system orchestrates three language agents, Triage, Search, and Response, built on large language models.<n>A real-world user study yields a System Usability Scale (SUS) score of 80.5, indicating excellent usability.
- Score: 8.217670177708632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Delivering intelligent and adaptive navigation assistance in augmented reality (AR) requires more than visual cues, as it demands systems capable of interpreting flexible user intent and reasoning over both spatial and semantic context. Prior AR navigation systems often rely on rigid input schemes or predefined commands, which limit the utility of rich building data and hinder natural interaction. In this work, we propose an embodied AR navigation system that integrates Building Information Modeling (BIM) with a multi-agent retrieval-augmented generation (RAG) framework to support flexible, language-driven goal retrieval and route planning. The system orchestrates three language agents, Triage, Search, and Response, built on large language models (LLMs), which enables robust interpretation of open-ended queries and spatial reasoning using BIM data. Navigation guidance is delivered through an embodied AR agent, equipped with voice interaction and locomotion, to enhance user experience. A real-world user study yields a System Usability Scale (SUS) score of 80.5, indicating excellent usability, and comparative evaluations show that the embodied interface can significantly improves users' perception of system intelligence. These results underscore the importance and potential of language-grounded reasoning and embodiment in the design of user-centered AR navigation systems.
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - NaviSense: A Multimodal Assistive Mobile application for Object Retrieval by Persons with Visual Impairment [21.405966774051326]
'NaviSense' is a mobile assistive system that combines conversational AI, vision-language models, augmented reality (AR), and LiDAR.<n>Users specify objects via natural language and receive continuous spatial feedback to navigate toward the target.<n>NaviSense significantly reduced object retrieval time and was preferred over existing tools.
arXiv Detail & Related papers (2025-09-23T05:45:11Z) - Natural Language-Driven Viewpoint Navigation for Volume Exploration via Semantic Block Representation [7.16051391212397]
We propose a novel framework that leverages natural language interaction to enhance volumetric data exploration.<n>Our approach encodes volumetric blocks to capture and differentiate underlying structures.<n>It further incorporates a CLIP Score mechanism, which provides semantic information to the blocks to guide navigation.
arXiv Detail & Related papers (2025-08-09T04:44:59Z) - NavigScene: Bridging Local Perception and Global Navigation for Beyond-Visual-Range Autonomous Driving [10.597463021650382]
NavigScene is an auxiliary navigation-guided natural language dataset that simulates a human-like driving environment within autonomous driving systems.<n>We develop three paradigms to leverage NavigScene: (1) Navigation-guided Reasoning, which enhances vision-language models by incorporating navigation context into the prompting approach; (2) Navigation-guided Preference Optimization, a reinforcement learning method that extends Direct Preference Optimization to improve vision-language model responses; and (3) Navigation-guided Vision-Language-Action model, which integrates navigation guidance and vision-language models with conventional driving models through feature fusion.
arXiv Detail & Related papers (2025-07-07T17:37:01Z) - Unifying Large Language Model and Deep Reinforcement Learning for Human-in-Loop Interactive Socially-aware Navigation [16.789333617628138]
Social robot navigation planners face two major challenges: managing real-time user inputs and ensuring socially compliant behaviors.<n>We introduce SALM, an interactive, human-in-loop Socially-Aware navigation Large Language Model framework.<n>A memory mechanism archives temporal data for continuous refinement, while a multi-step graph-of-thoughts inference-based large language feedback model adaptively fuses the strengths of both planning approaches.
arXiv Detail & Related papers (2024-03-22T23:12:28Z) - Large Language User Interfaces: Voice Interactive User Interfaces powered by LLMs [5.06113628525842]
We present a framework that can serve as an intermediary between a user and their user interface (UI)
We employ a system that stands upon textual semantic mappings of UI components, in the form of annotations.
Our engine can classify the most appropriate application, extract relevant parameters, and subsequently execute precise predictions of the user's expected actions.
arXiv Detail & Related papers (2024-02-07T21:08:49Z) - Large Language Models for Information Retrieval: A Survey [58.30439850203101]
Information retrieval has evolved from term-based methods to its integration with advanced neural models.
Recent research has sought to leverage large language models (LLMs) to improve IR systems.
We delve into the confluence of LLMs and IR systems, including crucial aspects such as query rewriters, retrievers, rerankers, and readers.
arXiv Detail & Related papers (2023-08-14T12:47:22Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - AVLEN: Audio-Visual-Language Embodied Navigation in 3D Environments [60.98664330268192]
We present AVLEN -- an interactive agent for Audio-Visual-Language Embodied Navigation.
The goal of AVLEN is to localize an audio event via navigating the 3D visual world.
To realize these abilities, AVLEN uses a multimodal hierarchical reinforcement learning backbone.
arXiv Detail & Related papers (2022-10-14T16:35:06Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.