Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time
Visual Scene Understanding
- URL: http://arxiv.org/abs/2309.15065v2
- Date: Tue, 5 Mar 2024 15:05:23 GMT
- Title: Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time
Visual Scene Understanding
- Authors: Christina Kassab, Matias Mattamala, Lintong Zhang, and Maurice Fallon
- Abstract summary: LEXIS is a real-time indoor Simultaneous Localization and Mapping system.
It harnesses the open-vocabulary nature of Large Language Models to create a unified approach to scene understanding and place recognition.
It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA)
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Versatile and adaptive semantic understanding would enable autonomous systems
to comprehend and interact with their surroundings. Existing fixed-class models
limit the adaptability of indoor mobile and assistive autonomous systems. In
this work, we introduce LEXIS, a real-time indoor Simultaneous Localization and
Mapping (SLAM) system that harnesses the open-vocabulary nature of Large
Language Models (LLMs) to create a unified approach to scene understanding and
place recognition. The approach first builds a topological SLAM graph of the
environment (using visual-inertial odometry) and embeds Contrastive
Language-Image Pretraining (CLIP) features in the graph nodes. We use this
representation for flexible room classification and segmentation, serving as a
basis for room-centric place recognition. This allows loop closure searches to
be directed towards semantically relevant places. Our proposed system is
evaluated using both public, simulated data and real-world data, covering
office and home environments. It successfully categorizes rooms with varying
layouts and dimensions and outperforms the state-of-the-art (SOTA). For place
recognition and trajectory estimation tasks we achieve equivalent performance
to the SOTA, all also utilizing the same pre-trained model. Lastly, we
demonstrate the system's potential for planning.
Related papers
- Learning Where to Look: Self-supervised Viewpoint Selection for Active Localization using Geometrical Information [68.10033984296247]
This paper explores the domain of active localization, emphasizing the importance of viewpoint selection to enhance localization accuracy.
Our contributions involve using a data-driven approach with a simple architecture designed for real-time operation, a self-supervised data training method, and the capability to consistently integrate our map into a planning framework tailored for real-world robotics applications.
arXiv Detail & Related papers (2024-07-22T12:32:09Z) - Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - Open-World Semantic Segmentation Including Class Similarity [31.799000996671975]
This paper tackles open-world semantic segmentation, i.e., the variant of interpreting image data in which objects occur that have not been seen during training.
We propose a novel approach that performs accurate closed-world semantic segmentation and can identify new categories without requiring any additional training data.
arXiv Detail & Related papers (2024-03-12T11:11:19Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - Closing the Loop: Graph Networks to Unify Semantic Objects and Visual
Features for Multi-object Scenes [2.236663830879273]
Loop Closure Detection (LCD) is essential to minimize drift when recognizing previously visited places.
Visual Bag-of-Words (vBoW) has been an LCD algorithm of choice for many state-of-the-art SLAM systems.
This paper proposes SymbioLCD2, which creates a unified graph structure to integrate semantic objects and visual features symbiotically.
arXiv Detail & Related papers (2022-09-24T00:42:33Z) - PLD-SLAM: A Real-Time Visual SLAM Using Points and Line Segments in
Dynamic Scenes [0.0]
This paper proposes a real-time stereo indirect visual SLAM system, PLD-SLAM, which combines point and line features.
We also present a novel global gray similarity (GGS) algorithm to achieve reasonable selection and efficient loop closure detection.
arXiv Detail & Related papers (2022-07-22T07:40:00Z) - Retrieval and Localization with Observation Constraints [12.010135672015704]
We propose an integrated visual re-localization method called RLOCS.
It combines image retrieval, semantic consistency and geometry verification to achieve accurate estimations.
Our method achieves many performance improvements on the challenging localization benchmarks.
arXiv Detail & Related papers (2021-08-19T06:14:33Z) - Goal-Oriented Gaze Estimation for Zero-Shot Learning [62.52340838817908]
We introduce a novel goal-oriented gaze estimation module (GEM) to improve the discriminative attribute localization.
We aim to predict the actual human gaze location to get the visual attention regions for recognizing a novel object guided by attribute description.
This work implies the promising benefits of collecting human gaze dataset and automatic gaze estimation algorithms on high-level computer vision tasks.
arXiv Detail & Related papers (2021-03-05T02:14:57Z) - Spatial Language Understanding for Object Search in Partially Observed
Cityscale Environments [21.528770932332474]
We introduce the spatial language observation space and formulate a model under the framework of Partially Observable Markov Decision Process (POMDP)
We propose a convolutional neural network model that learns to predict the language provider's relative frame of reference (FoR) given environment context.
We demonstrate the generalizability of our FoR prediction model and object search system through cross-validation over areas of five cities, each with a 40,000m$2$ footprint.
arXiv Detail & Related papers (2020-12-04T16:27:59Z) - Spatial Pyramid Based Graph Reasoning for Semantic Segmentation [67.47159595239798]
We apply graph convolution into the semantic segmentation task and propose an improved Laplacian.
The graph reasoning is directly performed in the original feature space organized as a spatial pyramid.
We achieve comparable performance with advantages in computational and memory overhead.
arXiv Detail & Related papers (2020-03-23T12:28:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.