Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People
- URL: http://arxiv.org/abs/2407.08219v1
- Date: Thu, 11 Jul 2024 06:40:36 GMT
- Title: Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People
- Authors: Zain Merchant, Abrar Anwar, Emily Wang, Souti Chattopadhyay, Jesse Thomason,
- Abstract summary: Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals.
We construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors.
- Score: 9.503205949175966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals. In this work, we construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors. We then investigate how grounded instruction generation methods can provide contextually-relevant navigational guidance to users in these instances. Through a sighted user study, we demonstrate that large pretrained language models can produce correct and useful instructions perceived as beneficial for BLV users. We also conduct a survey and interview with 4 BLV users and observe useful insights on preferences for different instructions based on the scenario.
Related papers
- Sightation Counts: Leveraging Sighted User Feedback in Building a BLV-aligned Dataset of Diagram Descriptions [5.6629291915019975]
We ask sighted individuals to assess -- rather than produce -- diagram descriptions generated by vision-language models (VLM)
We release Sightation, a collection of diagram description datasets spanning 5k diagrams and 137k samples for completion, preference, retrieval, question answering, and reasoning training purposes.
arXiv Detail & Related papers (2025-03-17T16:52:46Z) - Can LVLMs and Automatic Metrics Capture Underlying Preferences of Blind and Low-Vision Individuals for Navigational Aid? [16.31494394717809]
Blind and Low-Vision (BLV) people need assistance understanding their surroundings, especially in unfamiliar environments.
It has yet been studied preferences of BLV users on diverse types/styles of responses from Large Vision-Language Models (LVLMs)
We first construct Eye4B dataset, consisting of human-validated 1.1k curated outdoor/indoor scenes with 5-10 relevant requests per scene.
Then, we conduct an in-depth user study with eight BLV users to evaluate their preferences on six LVLMs from five perspectives: Afraidness, Nonactionability, Sufficiency, and Conciseness.
arXiv Detail & Related papers (2025-02-15T10:17:52Z) - Guide-LLM: An Embodied LLM Agent and Text-Based Topological Map for Robotic Guidance of People with Visual Impairments [1.18749525824656]
Guide-LLM is a text-based agent designed to assist persons with visual impairments (PVI) in navigating large indoor environments.
Our approach features a novel text-based topological map that enables the LLM to plan global paths.
Simulated experiments demonstrate the system's efficacy in guiding PVI, underscoring its potential as a significant advancement in assistive technology.
arXiv Detail & Related papers (2024-10-28T01:58:21Z) - Navigation Instruction Generation with BEV Perception and Large Language Models [60.455964599187205]
We propose BEVInstructor, which incorporates Bird's Eye View (BEV) features into Multi-Modal Large Language Models (MLLMs) for instruction generation.
Specifically, BEVInstructor constructs a PerspectiveBEV for the comprehension of 3D environments through fusing BEV and perspective features.
Based on the perspective-BEV prompts, BEVInstructor further adopts an instance-guided iterative refinement pipeline, which improves the instructions in a progressive manner.
arXiv Detail & Related papers (2024-07-21T08:05:29Z) - A Multi-Modal Foundation Model to Assist People with Blindness and Low Vision in Environmental Interaction [25.6637754177118]
People with blindness and low vision (pBLV) encounter substantial challenges when it comes to comprehensive scene recognition and precise object identification.
We present a pioneering approach that leverages a large vision-language model to enhance visual perception for pBLV.
arXiv Detail & Related papers (2023-10-31T06:56:51Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - VELMA: Verbalization Embodiment of LLM Agents for Vision and Language
Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities.
We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples.
We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z) - Bridging the visual gap in VLN via semantically richer instructions [3.5789352263336847]
We show that state-of-the-art models are not severely affected when they receive just limited or even no visual data.
We propose a new data augmentation method that fosters the inclusion of more explicit visual information.
arXiv Detail & Related papers (2022-10-27T15:58:07Z) - Counterfactual Cycle-Consistent Learning for Instruction Following and
Generation in Vision-Language Navigation [172.15808300686584]
We describe an approach that learns the two tasks simultaneously and exploits their intrinsic correlations to boost the training of each.
Our approach improves the performance of various follower models and produces accurate navigation instructions.
arXiv Detail & Related papers (2022-03-30T18:15:26Z) - Contrastive Instruction-Trajectory Learning for Vision-Language
Navigation [66.16980504844233]
A vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Previous works fail to discriminate the similarities and discrepancies across instruction-trajectory pairs and ignore the temporal continuity of sub-instructions.
We propose a Contrastive Instruction-Trajectory Learning framework that explores invariance across similar data samples and variance across different ones to learn distinctive representations for robust navigation.
arXiv Detail & Related papers (2021-12-08T06:32:52Z) - Know What and Know Where: An Object-and-Room Informed Sequential BERT
for Indoor Vision-Language Navigation [120.90387630691816]
Vision-and-Language Navigation (VLN) requires an agent to navigate to a remote location on the basis of natural-language instructions and a set of photo-realistic panoramas.
Most existing methods take words in instructions and discrete views of each panorama as the minimal unit of encoding.
We propose an object-informed sequential BERT to encode visual perceptions and linguistic instructions at the same fine-grained level.
arXiv Detail & Related papers (2021-04-09T02:44:39Z) - Counterfactual Vision-and-Language Navigation via Adversarial Path Sampling [65.99956848461915]
Vision-and-Language Navigation (VLN) is a task where agents must decide how to move through a 3D environment to reach a goal.
One of the problems of the VLN task is data scarcity since it is difficult to collect enough navigation paths with human-annotated instructions for interactive environments.
We propose an adversarial-driven counterfactual reasoning model that can consider effective conditions instead of low-quality augmented data.
arXiv Detail & Related papers (2019-11-17T18:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.