Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models
- URL: http://arxiv.org/abs/2409.15451v1
- Date: Mon, 23 Sep 2024 18:26:19 GMT
- Title: Tag Map: A Text-Based Map for Spatial Reasoning and Navigation with Large Language Models
- Authors: Mike Zhang, Kaixian Qu, Vaishakh Patil, Cesar Cadena, Marco Hutter,
- Abstract summary: Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning.
Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps.
We propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs.
- Score: 15.454856838083511
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Large Language Models (LLM) have emerged as a tool for robots to generate task plans using common sense reasoning. For the LLM to generate actionable plans, scene context must be provided, often through a map. Recent works have shifted from explicit maps with fixed semantic classes to implicit open vocabulary maps based on queryable embeddings capable of representing any semantic class. However, embeddings cannot directly report the scene context as they are implicit, requiring further processing for LLM integration. To address this, we propose an explicit text-based map that can represent thousands of semantic classes while easily integrating with LLMs due to their text-based nature by building upon large-scale image recognition models. We study how entities in our map can be localized and show through evaluations that our text-based map localizations perform comparably to those from open vocabulary maps while using two to four orders of magnitude less memory. Real-robot experiments demonstrate the grounding of an LLM with the text-based map to solve user tasks.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - Vision Language Models Can Parse Floor Plan Maps [5.902912356816188]
Vision language models (VLMs) can simultaneously reason about images and texts to tackle many tasks.
This paper focuses on map parsing, a novel task that is unexplored within the VLM context.
arXiv Detail & Related papers (2024-09-19T15:36:28Z) - Mapping High-level Semantic Regions in Indoor Environments without
Object Recognition [50.624970503498226]
The present work proposes a method for semantic region mapping via embodied navigation in indoor environments.
To enable region identification, the method uses a vision-to-language model to provide scene information for mapping.
By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location.
arXiv Detail & Related papers (2024-03-11T18:09:50Z) - Sat2Cap: Mapping Fine-Grained Textual Descriptions from Satellite Images [12.356676398446215]
We propose a weakly supervised approach for creating maps using free-form textual descriptions.
We train a contrastive learning framework called Sat2Cap on a new large-scale dataset with 6.1M pairs of overhead and ground-level images.
arXiv Detail & Related papers (2023-07-29T06:23:51Z) - BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [101.50522135049198]
BuboGPT is a multi-modal LLM with visual grounding that can perform cross-modal interaction between vision, audio and language.
Our contributions are two-fold: 1) An off-the-shelf visual grounding module based on SAM that extracts entities in a sentence and find corresponding masks in the image.
Our experiments show that BuboGPT achieves impressive multi-modality understanding and visual grounding abilities during the interaction with human.
arXiv Detail & Related papers (2023-07-17T15:51:47Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - Visual Language Maps for Robot Navigation [30.33041779258644]
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data.
We propose VLMaps, a spatial map representation that directly fuses pretrained visual-language features with a 3D reconstruction of the physical world.
arXiv Detail & Related papers (2022-10-11T18:13:20Z) - Open-vocabulary Queryable Scene Representations for Real World Planning [56.175724306976505]
Large language models (LLMs) have unlocked new capabilities of task planning from human instructions.
However, prior attempts to apply LLMs to real-world robotic tasks are limited by the lack of grounding in the surrounding scene.
We develop NLMap, an open-vocabulary and queryable scene representation to address this problem.
arXiv Detail & Related papers (2022-09-20T17:29:56Z) - Synthetic Map Generation to Provide Unlimited Training Data for
Historical Map Text Detection [5.872532529455414]
We propose a method to automatically generate an unlimited amount of annotated historical map images for training text detection models.
We show that the state-of-the-art text detection models can benefit from the synthetic historical maps.
arXiv Detail & Related papers (2021-12-12T00:27:03Z) - Rethinking Localization Map: Towards Accurate Object Perception with
Self-Enhancement Maps [78.2581910688094]
This work introduces a novel self-enhancement method to harvest accurate object localization maps and object boundaries with only category labels as supervision.
In particular, the proposed Self-Enhancement Maps achieve the state-of-the-art localization accuracy of 54.88% on ILSVRC.
arXiv Detail & Related papers (2020-06-09T12:35:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.