Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments
- URL: http://arxiv.org/abs/2503.23105v1
- Date: Sat, 29 Mar 2025 14:46:45 GMT
- Title: Open-Vocabulary Semantic Segmentation with Uncertainty Alignment for Robotic Scene Understanding in Indoor Building Environments
- Authors: Yifan Xu, Vineet Kamat, Carol Menassa,
- Abstract summary: We propose an open-vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs)<n>Our approach follows a 'Segment Detect Select' framework for open-vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.
- Score: 6.295098866364597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The global rise in the number of people with physical disabilities, in part due to improvements in post-trauma survivorship and longevity, has amplified the demand for advanced assistive technologies to improve mobility and independence. Autonomous assistive robots, such as smart wheelchairs, require robust capabilities in spatial segmentation and semantic recognition to navigate complex built environments effectively. Place segmentation involves delineating spatial regions like rooms or functional areas, while semantic recognition assigns semantic labels to these regions, enabling accurate localization to user-specific needs. Existing approaches often utilize deep learning; however, these close-vocabulary detection systems struggle to interpret intuitive and casual human instructions. Additionally, most existing methods ignore the uncertainty of the scene recognition problem, leading to low success rates, particularly in ambiguous and complex environments. To address these challenges, we propose an open-vocabulary scene semantic segmentation and detection pipeline leveraging Vision Language Models (VLMs) and Large Language Models (LLMs). Our approach follows a 'Segment Detect Select' framework for open-vocabulary scene classification, enabling adaptive and intuitive navigation for assistive robots in built environments.
Related papers
- From Open-Vocabulary to Vocabulary-Free Semantic Segmentation [78.62232202171919]
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data.<n>Current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications.<n>This work proposes a Vocabulary-Free Semantic pipeline, eliminating the need for predefined class vocabularies.
arXiv Detail & Related papers (2025-02-17T15:17:08Z) - Exploring Emerging Trends and Research Opportunities in Visual Place Recognition [28.76562316749074]
Visual-based recognition is a long-standing challenge in computer vision and robotics communities.
Visual place recognition is vital for most localization implementations.
Researchers have recently turned their attention to vision-language models.
arXiv Detail & Related papers (2024-11-18T11:36:17Z) - Embodied-RAG: General Non-parametric Embodied Memory for Retrieval and Generation [69.01029651113386]
Embodied-RAG is a framework that enhances the model of an embodied agent with a non-parametric memory system.<n>At its core, Embodied-RAG's memory is structured as a semantic forest, storing language descriptions at varying levels of detail.<n>We demonstrate that Embodied-RAG effectively bridges RAG to the robotics domain, successfully handling over 250 explanation and navigation queries.
arXiv Detail & Related papers (2024-09-26T21:44:11Z) - Sign language recognition based on deep learning and low-cost handcrafted descriptors [0.0]
It is important to consider as many linguistic parameters as possible in gesture execution to avoid ambiguity between words.
It is essential to ensure that the chosen technology is realistic, avoiding expensive, intrusive, or low-mobility sensors.
We propose an efficient sign language recognition system that utilizes low-cost sensors and techniques.
arXiv Detail & Related papers (2024-08-14T00:56:51Z) - Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information [68.10033984296247]
This paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy.
Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications.
arXiv Detail & Related papers (2024-07-22T12:32:09Z) - O2V-Mapping: Online Open-Vocabulary Mapping with Neural Implicit Representation [9.431926560072412]
We propose O2V-mapping, which utilizes voxel-based language and geometric features to create an open-vocabulary field.
Experiments on open-vocabulary object localization and semantic segmentation demonstrate that O2V-mapping achieves online construction of language scenes.
arXiv Detail & Related papers (2024-04-10T08:54:43Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - A Threefold Review on Deep Semantic Segmentation: Efficiency-oriented,
Temporal and Depth-aware design [77.34726150561087]
We conduct a survey on the most relevant and recent advances in Deep Semantic in the context of vision for autonomous vehicles.
Our main objective is to provide a comprehensive discussion on the main methods, advantages, limitations, results and challenges faced from each perspective.
arXiv Detail & Related papers (2023-03-08T01:29:55Z) - SCIM: Simultaneous Clustering, Inference, and Mapping for Open-World
Semantic Scene Understanding [34.19666841489646]
We show how a robot can autonomously discover novel semantic classes and improve accuracy on known classes when exploring an unknown environment.
We develop a general framework for mapping and clustering that we then use to generate a self-supervised learning signal to update a semantic segmentation model.
In particular, we show how clustering parameters can be optimized during deployment and that fusion of multiple observation modalities improves novel object discovery compared to prior work.
arXiv Detail & Related papers (2022-06-21T18:41:51Z) - Visual-Language Navigation Pretraining via Prompt-based Environmental
Self-exploration [83.96729205383501]
We introduce prompt-based learning to achieve fast adaptation for language embeddings.
Our model can adapt to diverse vision-language navigation tasks, including VLN and REVERIE.
arXiv Detail & Related papers (2022-03-08T11:01:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.