Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired
- URL: http://arxiv.org/abs/2502.07183v2
- Date: Wed, 12 Feb 2025 09:07:32 GMT
- Title: Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired
- Authors: ByungOk Han, Woo-han Yun, Beom-Su Seo, Jaehong Kim,
- Abstract summary: We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench)
Our data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings.
We propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance.
- Score: 0.2410625015892047
- License:
- Abstract: Guide dog robots offer promising solutions to enhance mobility and safety for visually impaired individuals, addressing the limitations of traditional guide dogs, particularly in perceptual intelligence and communication. With the emergence of Vision-Language Models (VLMs), robots are now capable of generating natural language descriptions of their surroundings, aiding in safer decision-making. However, existing VLMs often struggle to accurately interpret and convey spatial relationships, which is crucial for navigation in complex environments such as street crossings. We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench) to address the limitations of current VLMs in understanding physical environments. Our automated data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings, enhancing environmental comprehension and enabling VLMs to provide more accurate guidance to visually impaired individuals. We also propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance. Comparative experiments demonstrate that our space-aware instruction-tuned model outperforms state-of-the-art algorithms. We have fully open-sourced the SAIT dataset and SA-Bench, along with the related code, at https://github.com/byungokhan/Space-awareVLM
Related papers
- Visual Language Models as Operator Agents in the Space Domain [36.943670587532026]
Vision-Language Models (VLMs) can enhance autonomous control and decision-making in space missions.
In the software context, we employ VLMs to interpret visual screenshots of the graphical user interface to perform complex orbital maneuvers.
In the hardware context, we integrate VLMs with robotic systems equipped with cameras to inspect and diagnose physical space objects, such as satellites.
arXiv Detail & Related papers (2025-01-14T03:03:37Z) - Seeing with Partial Certainty: Conformal Prediction for Robotic Scene Recognition in Built Environments [6.295098866364597]
We introduce Seeing with Partial Certainty (SwPC) - a framework designed to measure and align uncertainty in VLM-based place recognition.
SwPC is built on the theory of conformal prediction to provide statistical guarantees on place recognition while minimizing requests for human help.
arXiv Detail & Related papers (2025-01-09T03:50:00Z) - Aerial Vision-and-Language Navigation via Semantic-Topo-Metric Representation Guided LLM Reasoning [48.33405770713208]
We propose an end-to-end framework for aerial VLN tasks, where the large language model (LLM) is introduced as our agent for action prediction.
We develop a novel Semantic-Topo-Metric Representation (STMR) to enhance the spatial reasoning ability of LLMs.
Experiments conducted in real and simulation environments have successfully proved the effectiveness and robustness of our method.
arXiv Detail & Related papers (2024-10-11T03:54:48Z) - SpatialBot: Precise Spatial Understanding with Vision Language Models [12.67089704185187]
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding.
They are still struggling with spatial understanding which is the foundation of Embodied AI.
In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images.
arXiv Detail & Related papers (2024-06-19T15:41:30Z) - SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models [68.13636352687257]
We introduce Spatial Region GPT (SpatialRGPT) to enhance VLMs' spatial perception and reasoning capabilities.
During inference, when provided with user-specified region proposals, SpatialRGPT can accurately perceive their relative directions and distances.
Our results demonstrate that SpatialRGPT significantly enhances performance in spatial reasoning tasks, both with and without local region prompts.
arXiv Detail & Related papers (2024-06-03T17:59:06Z) - PIVOT: Iterative Visual Prompting Elicits Actionable Knowledge for VLMs [140.14239499047977]
Vision language models (VLMs) have shown impressive capabilities across a variety of tasks, from logical reasoning to visual understanding.
We propose a novel visual prompting approach for VLMs that we call Prompting with Iterative Visual Optimization (PIVOT)
We find, perhaps surprisingly, that our approach enables zero-shot control of robotic systems without any robot training data, navigation in a variety of environments, and other capabilities.
arXiv Detail & Related papers (2024-02-12T18:33:47Z) - SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning
Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics.
We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images.
By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z) - Voila-A: Aligning Vision-Language Models with User's Gaze Attention [56.755993500556734]
We introduce gaze information as a proxy for human attention to guide Vision-Language Models (VLMs)
We propose a novel approach, Voila-A, for gaze alignment to enhance the interpretability and effectiveness of these models in real-world applications.
arXiv Detail & Related papers (2023-12-22T17:34:01Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z) - Trajectory annotation using sequences of spatial perception [0.0]
In the near future, more and more machines will perform tasks in the vicinity of human spaces.
This work builds a foundation to address this task.
We propose an unsupervised learning approach based on a neural autoencoding that learns semantically meaningful continuous encodings of prototypical trajectory data.
arXiv Detail & Related papers (2020-04-11T12:22:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.