Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments
- URL: http://arxiv.org/abs/2410.20666v1
- Date: Mon, 28 Oct 2024 01:58:21 GMT
- Title: Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments
- Authors: Sangmim Song, Sarath Kodagoda, Amal Gunatilake, Marc G. Carmichael, Karthick Thiyagarajan, Jodi Martin,
- Abstract summary: Guide-LLM is a text-based agent designed to assist persons with visual impairments (PVI) in navigating large indoor environments.
Our approach features a novel text-based topological map that enables the LLM to plan global paths.
Simulated experiments demonstrate the system's efficacy in guiding PVI, underscoring its potential as a significant advancement in assistive technology.
- Score: 1.18749525824656
- License:
- Abstract: Navigation presents a significant challenge for persons with visual impairments (PVI). While traditional aids such as white canes and guide dogs are invaluable, they fall short in delivering detailed spatial information and precise guidance to desired locations. Recent developments in large language models (LLMs) and vision-language models (VLMs) offer new avenues for enhancing assistive navigation. In this paper, we introduce Guide-LLM, an embodied LLM-based agent designed to assist PVI in navigating large indoor environments. Our approach features a novel text-based topological map that enables the LLM to plan global paths using a simplified environmental representation, focusing on straight paths and right-angle turns to facilitate navigation. Additionally, we utilize the LLM's commonsense reasoning for hazard detection and personalized path planning based on user preferences. Simulated experiments demonstrate the system's efficacy in guiding PVI, underscoring its potential as a significant advancement in assistive technology. The results highlight Guide-LLM's ability to offer efficient, adaptive, and personalized navigation assistance, pointing to promising advancements in this field.
Related papers
- Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning [48.33405770713208]
We propose an end-to-end framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction.
We develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs.
Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method.
arXiv Detail & Related papers (2024-10-11T03:54:48Z) - NavGPT-2: Unleashing Navigational Reasoning Capability for Large Vision-Language Models [30.685419129265252]
We bridge the divide between VLN-specialized models and LLM-based navigation paradigms.
We exploit a way to incorporate LLMs and navigation policy networks for effective action predictions and navigational reasoning.
arXiv Detail & Related papers (2024-07-17T07:44:26Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - MC-GPT: Empowering Vision-and-Language Navigation with Memory Map and Reasoning Chains [4.941781282578696]
In the Vision-and-Language Navigation (VLN) task, the agent is required to navigate to a destination following a natural language instruction.
While learning-based approaches have been a major solution to the task, they suffer from high training costs and lack of interpretability.
Recently, Large Language Models (LLMs) have emerged as a promising tool for VLN due to their strong generalization capabilities.
arXiv Detail & Related papers (2024-05-17T08:33:27Z) - TINA: Think, Interaction, and Action Framework for Zero-Shot Vision Language Navigation [11.591176410027224]
This paper presents a Vision-Language Navigation (VLN) agent based on Large Language Models (LLMs)
We propose the Thinking, Interacting, and Action framework to compensate for the shortcomings of LLMs in environmental perception.
Our approach also outperformed some supervised learning-based methods, highlighting its efficacy in zero-shot navigation.
arXiv Detail & Related papers (2024-03-13T05:22:39Z) - NavCoT: Boosting LLM-Based Vision-and-Language Navigation via Learning
Disentangled Reasoning [101.56342075720588]
Vision-and-Language Navigation (VLN), as a crucial research problem of Embodied AI, requires an embodied agent to navigate through complex 3D environments following natural language instructions.
Recent research has highlighted the promising capacity of large language models (LLMs) in VLN by improving navigational reasoning accuracy and interpretability.
This paper introduces a novel strategy called Navigational Chain-of-Thought (NavCoT), where we fulfill parameter-efficient in-domain training to enable self-guided navigational decision.
arXiv Detail & Related papers (2024-03-12T07:27:02Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - NavGPT: Explicit Reasoning in Vision-and-Language Navigation with Large
Language Models [17.495162643127003]
We introduce the NavGPT to reveal the reasoning capability of GPT models in complex embodied scenes.
NavGPT takes the textual descriptions of visual observations, navigation history, and future explorable directions as inputs to reason the agent's current status.
We show that NavGPT is capable of generating high-quality navigational instructions from observations and actions along a path.
arXiv Detail & Related papers (2023-05-26T14:41:06Z) - Can an Embodied Agent Find Your "Cat-shaped Mug"? LLM-Guided Exploration
for Zero-Shot Object Navigation [58.3480730643517]
We present LGX, a novel algorithm for Language-Driven Zero-Shot Object Goal Navigation (L-ZSON)
Our approach makes use of Large Language Models (LLMs) for this task.
We achieve state-of-the-art zero-shot object navigation results on RoboTHOR with a success rate (SR) improvement of over 27% over the current baseline.
arXiv Detail & Related papers (2023-03-06T20:19:19Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.