Related papers: StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI

StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI

URL: http://arxiv.org/abs/2508.08524v4
Date: Fri, 26 Sep 2025 13:19:50 GMT
Title: StreetReaderAI: Making Street View Accessible Using Context-Aware Multimodal AI
Authors: Jon E. Froehlich, Alexander Fiannaca, Nimer Jaber, Victor Tsaran, Shaun Kane,
Abstract summary: We introduce StreetReaderAI, the first-ever accessible street view tool.<n>With StreetReaderAI, blind users can virtually examine destinations, engage in open-world exploration, or virtually tour any of the over 220 billion images.<n>Our findings demonstrate the value of an accessible street view in supporting POI investigations and remote route planning.
Score: 44.37880707956907
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Interactive streetscape mapping tools such as Google Street View (GSV) and Meta Mapillary enable users to virtually navigate and experience real-world environments via immersive 360{\deg} imagery but remain fundamentally inaccessible to blind users. We introduce StreetReaderAI, the first-ever accessible street view tool, which combines context-aware, multimodal AI, accessible navigation controls, and conversational speech. With StreetReaderAI, blind users can virtually examine destinations, engage in open-world exploration, or virtually tour any of the over 220 billion images and 100+ countries where GSV is deployed. We iteratively designed StreetReaderAI with a mixed-visual ability team and performed an evaluation with eleven blind users. Our findings demonstrate the value of an accessible street view in supporting POI investigations and remote route planning. We close by enumerating key guidelines for future work.

Related papers

Thinking in 360°: Humanoid Visual Search in the Wild [52.29500214210115]
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360.<n>We propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360 panoramic image.<n>Our experiments first reveal that even top-tier proprietary models falter, achieving only 30% success in object and path search.
arXiv Detail & Related papers (2025-11-25T14:30:10Z)
mmWalk: Towards Multi-modal Multi-view Walking Assistance [44.184803877778556]
mmWalk is a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation.<n>Our dataset comprises 120 manually controlled, scenario-categorized walking trajectories with 62k synchronized frames.<n>We generate mmWalkVQA, a VQA benchmark with over 69k visual question-answer triplets across 9 categories tailored for safe and informed walking assistance.
arXiv Detail & Related papers (2025-10-13T15:25:52Z)
OpenFly: A Comprehensive Platform for Aerial Vision-Language Navigation [49.697035403548966]
Vision-Language Navigation (VLN) aims to guide agents by leveraging language instructions and visual cues, playing a pivotal role in embodied AI.<n>We propose OpenFly, a platform comprising various rendering engines, a versatile toolchain, and a large-scale benchmark for aerial VLN.<n>We construct a large-scale aerial VLN dataset with 100k trajectories, covering diverse heights and lengths across 18 scenes.
arXiv Detail & Related papers (2025-02-25T09:57:18Z)
WildVis: Open Source Visualizer for Million-Scale Chat Logs in the Wild [88.05964311416717]
We introduce WildVis, an interactive tool that enables fast, versatile, and large-scale conversation analysis. WildVis provides search and visualization capabilities in the text and embedding spaces based on a list of criteria. We demonstrate WildVis' utility through three case studies: facilitating misuse research, visualizing and comparing topic distributions across datasets, and characterizing user-specific conversation patterns.
arXiv Detail & Related papers (2024-09-05T17:59:15Z)
Generating Contextually-Relevant Navigation Instructions for Blind and Low Vision People [9.503205949175966]
Navigating unfamiliar environments presents significant challenges for blind and low-vision (BLV) individuals. We construct a dataset of images and goals across different scenarios such as searching through kitchens or navigating outdoors.
arXiv Detail & Related papers (2024-07-11T06:40:36Z)
OpenStreetView-5M: The Many Roads to Global Visual Geolocation [16.468438245804684]
We introduce OpenStreetView-5M, a large-scale, open-access dataset comprising over 5.1 million geo-referenced street view images. In contrast to existing benchmarks, we enforce a strict train/test separation, allowing us to evaluate the relevance of learned geographical features. To demonstrate the utility of our dataset, we conduct an extensive benchmark of various state-of-the-art image encoders, spatial representations, and training strategies.
arXiv Detail & Related papers (2024-04-29T17:06:44Z)
Visualizing Routes with AI-Discovered Street-View Patterns [4.153397474276339]
We propose a solution of using semantic latent vectors for quantifying visual appearance features. We calculate image similarities among a large set of street-view images and then discover spatial imagery patterns. We present VivaRoutes, an interactive visualization prototype, to show how visualizations leveraged with these discovered patterns can help users effectively and interactively explore multiple routes.
arXiv Detail & Related papers (2024-03-30T17:32:26Z)
To use or not to use proprietary street view images in (health and place) research? That is the question [0.20999222360659603]
This article questions the current practices in using Google Street View images from a European viewpoint. Our concern lies with Google's terms of service, which restrict bulk image downloads and the generation of street view image-based indices.
arXiv Detail & Related papers (2024-02-18T08:26:22Z)
VELMA: Verbalization Embodiment of LLM Agents for Vision and Language Navigation in Street View [81.58612867186633]
Vision and Language Navigation(VLN) requires visual and natural language understanding as well as spatial and temporal reasoning capabilities. We show that VELMA is able to successfully follow navigation instructions in Street View with only two in-context examples. We further finetune the LLM agent on a few thousand examples and achieve 25%-30% relative improvement in task completion over the previous state-of-the-art for two datasets.
arXiv Detail & Related papers (2023-07-12T11:08:24Z)
OVIS: Open-Vocabulary Visual Instance Search via Visual-Semantic Aligned Representation Learning [79.49199857462087]
We introduce the task of open-vocabulary visual instance search (OVIS) Given an arbitrary textual search query, OVIS aims to return a ranked list of visual instances. We propose to address such a search challenge via visual-semantic aligned representation learning (ViSA)
arXiv Detail & Related papers (2021-08-08T18:13:53Z)
Deep Learning for Embodied Vision Navigation: A Survey [108.13766213265069]
"Embodied visual navigation" problem requires an agent to navigate in a 3D environment mainly rely on its first-person observation. This paper attempts to establish an outline of the current works in the field of embodied visual navigation by providing a comprehensive literature survey.
arXiv Detail & Related papers (2021-07-07T12:09:04Z)
Pathdreamer: A World Model for Indoor Navigation [62.78410447776939]
We introduce Pathdreamer, a visual world model for agents navigating in novel indoor environments. Given one or more previous visual observations, Pathdreamer generates plausible high-resolution 360 visual observations. In regions of high uncertainty, Pathdreamer can predict diverse scenes, allowing an agent to sample multiple realistic outcomes.
arXiv Detail & Related papers (2021-05-18T18:13:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.