Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models
- URL: http://arxiv.org/abs/2405.02162v3
- Date: Thu, 10 Oct 2024 16:03:42 GMT
- Title: Mapping the Unseen: Unified Promptable Panoptic Mapping with Dynamic Labeling using Foundation Models
- Authors: Mohamad Al Mdfaa, Raghad Salameh, Sergey Zagoruyko, Gonzalo Ferrer,
- Abstract summary: We introduce the Unified Promptable Panoptic Mapping (UPPM) method.
UPPM incorporates a dynamic labeling strategy into traditional panoptic mapping techniques.
Results show that UPPM can accurately reconstruct scenes and segment objects while generating rich semantic labels.
- Score: 3.127265144073288
- License:
- Abstract: In the field of robotics and computer vision, efficient and accurate semantic mapping remains a significant challenge due to the growing demand for intelligent machines that can comprehend and interact with complex environments. Conventional panoptic mapping methods, however, are limited by predefined semantic classes, thus making them ineffective for handling novel or unforeseen objects. In response to this limitation, we introduce the Unified Promptable Panoptic Mapping (UPPM) method. UPPM utilizes recent advances in foundation models to enable real-time, on-demand label generation using natural language prompts. By incorporating a dynamic labeling strategy into traditional panoptic mapping techniques, UPPM provides significant improvements in adaptability and versatility while maintaining high performance levels in map reconstruction. We demonstrate our approach on real-world and simulated datasets. Results show that UPPM can accurately reconstruct scenes and segment objects while generating rich semantic labels through natural language interactions. A series of ablation experiments validated the advantages of foundation model-based labeling over fixed label sets.
Related papers
- Flex: End-to-End Text-Instructed Visual Navigation with Foundation Models [59.892436892964376]
We investigate the minimal data requirements and architectural adaptations necessary to achieve robust closed-loop performance with vision-based control policies.
Our findings are synthesized in Flex (Fly-lexically), a framework that uses pre-trained Vision Language Models (VLMs) as frozen patch-wise feature extractors.
We demonstrate the effectiveness of this approach on quadrotor fly-to-target tasks, where agents trained via behavior cloning successfully generalize to real-world scenes.
arXiv Detail & Related papers (2024-10-16T19:59:31Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Joint-Embedding Masked Autoencoder for Self-supervised Learning of
Dynamic Functional Connectivity from the Human Brain [18.165807360855435]
Graph Neural Networks (GNNs) have shown promise in learning dynamic functional connectivity for distinguishing phenotypes from human brain networks.
We introduce the Spatio-Temporal Joint Embedding Masked Autoencoder (ST-JEMA), drawing inspiration from the Joint Embedding Predictive Architecture (JEPA) in computer vision.
arXiv Detail & Related papers (2024-03-11T04:49:41Z) - Background Activation Suppression for Weakly Supervised Object
Localization and Semantic Segmentation [84.62067728093358]
Weakly supervised object localization and semantic segmentation aim to localize objects using only image-level labels.
New paradigm has emerged by generating a foreground prediction map to achieve pixel-level localization.
This paper presents two astonishing experimental observations on the object localization learning process.
arXiv Detail & Related papers (2023-09-22T15:44:10Z) - A Multi-label Classification Approach to Increase Expressivity of
EMG-based Gesture Recognition [4.701158597171363]
The aim of this study is to efficiently increase the expressivity of surface electromyography-based (sEMG) gesture recognition systems.
We use a problem transformation approach, in which actions were subset into two biomechanically independent components.
arXiv Detail & Related papers (2023-09-13T20:21:41Z) - Knowledge-augmented Frame Semantic Parsing with Hybrid Prompt-tuning [17.6573121083417]
We propose a Knowledge-Augmented Frame Semantic Parsing Architecture (KAF-SPA) to enhance semantic representation.
A Memory-based Knowledge Extraction Module (MKEM) is devised to select accurate frame knowledge and construct the continuous templates.
We also design a Task-oriented Knowledge Probing Module (TKPM) using hybrid prompts to incorporate the selected knowledge into the PLMs and adapt PLMs to the tasks of frame and argument identification.
arXiv Detail & Related papers (2023-03-25T06:41:19Z) - Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - UIA-ViT: Unsupervised Inconsistency-Aware Method based on Vision
Transformer for Face Forgery Detection [52.91782218300844]
We propose a novel Unsupervised Inconsistency-Aware method based on Vision Transformer, called UIA-ViT.
Due to the self-attention mechanism, the attention map among patch embeddings naturally represents the consistency relation, making the vision Transformer suitable for the consistency representation learning.
arXiv Detail & Related papers (2022-10-23T15:24:47Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - Lightweight Object-level Topological Semantic Mapping and Long-term
Global Localization based on Graph Matching [19.706907816202946]
We present a novel lightweight object-level mapping and localization method with high accuracy and robustness.
We use object-level features with both semantic and geometric information to model landmarks in the environment.
Based on the proposed map, the robust localization is achieved by constructing a novel local semantic scene graph descriptor.
arXiv Detail & Related papers (2022-01-16T05:47:07Z) - Generating Synthetic Data for Task-Oriented Semantic Parsing with
Hierarchical Representations [0.8203855808943658]
In this work, we explore the possibility of generating synthetic data for neural semantic parsing.
Specifically, we first extract masked templates from the existing labeled utterances, and then fine-tune BART to generate synthetic utterances conditioning.
We show the potential of our approach when evaluating on the Facebook TOP dataset for navigation domain.
arXiv Detail & Related papers (2020-11-03T22:55:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.