Interactive Navigation in Environments with Traversable Obstacles Using
Large Language and Vision-Language Models
- URL: http://arxiv.org/abs/2310.08873v3
- Date: Wed, 13 Mar 2024 02:53:30 GMT
- Title: Interactive Navigation in Environments with Traversable Obstacles Using
Large Language and Vision-Language Models
- Authors: Zhen Zhang, Anran Lin, Chun Wai Wong, Xiangyu Chu, Qi Dou, and K. W.
Samuel Au
- Abstract summary: This paper proposes an interactive navigation framework by using large language and vision-language models.
We create an action-aware costmap to perform effective path planning without fine-tuning.
All experimental results demonstrated the proposed framework's effectiveness and adaptability to diverse environments.
- Score: 14.871309526022516
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes an interactive navigation framework by using large
language and vision-language models, allowing robots to navigate in
environments with traversable obstacles. We utilize the large language model
(GPT-3.5) and the open-set Vision-language Model (Grounding DINO) to create an
action-aware costmap to perform effective path planning without fine-tuning.
With the large models, we can achieve an end-to-end system from textual
instructions like "Can you pass through the curtains to deliver medicines to
me?", to bounding boxes (e.g., curtains) with action-aware attributes. They can
be used to segment LiDAR point clouds into two parts: traversable and
untraversable parts, and then an action-aware costmap is constructed for
generating a feasible path. The pre-trained large models have great
generalization ability and do not require additional annotated data for
training, allowing fast deployment in the interactive navigation tasks. We
choose to use multiple traversable objects such as curtains and grasses for
verification by instructing the robot to traverse them. Besides, traversing
curtains in a medical scenario was tested. All experimental results
demonstrated the proposed framework's effectiveness and adaptability to diverse
environments.
Related papers
- IN-Sight: Interactive Navigation through Sight [20.184155117341497]
IN-Sight is a novel approach to self-supervised path planning.
It calculates traversability scores and incorporates them into a semantic map.
To precisely navigate around obstacles, IN-Sight employs a local planner.
arXiv Detail & Related papers (2024-08-01T07:27:54Z) - Constrained Robotic Navigation on Preferred Terrains Using LLMs and Speech Instruction: Exploiting the Power of Adverbs [29.507826791509384]
This paper explores leveraging large language models for map-free off-road navigation using generative AI.
We propose a method where a robot receives verbal instructions, converted to text through Whisper, and a large language model extracts landmarks, preferred terrains, and crucial adverbs translated into speed settings for constrained navigation.
arXiv Detail & Related papers (2024-04-02T20:46:13Z) - Navigation with Large Language Models: Semantic Guesswork as a Heuristic
for Planning [73.0990339667978]
Navigation in unfamiliar environments presents a major challenge for robots.
We use language models to bias exploration of novel real-world environments.
We evaluate LFG in challenging real-world environments and simulated benchmarks.
arXiv Detail & Related papers (2023-10-16T06:21:06Z) - Multimodal Large Language Model for Visual Navigation [20.53387240108225]
Our approach aims to fine-tune large language models for visual navigation without extensive prompt engineering.
Our design involves a simple text prompt, current observations, and a history collector model that gathers information from previous observations as input.
We train our model using human demonstrations and collision signals from the Habitat-Matterport 3D dataset.
arXiv Detail & Related papers (2023-10-12T19:01:06Z) - LangNav: Language as a Perceptual Representation for Navigation [63.90602960822604]
We explore the use of language as a perceptual representation for vision-and-language navigation (VLN)
Our approach uses off-the-shelf vision systems for image captioning and object detection to convert an agent's egocentric panoramic view at each time step into natural language descriptions.
arXiv Detail & Related papers (2023-10-11T20:52:30Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - VoxPoser: Composable 3D Value Maps for Robotic Manipulation with
Language Models [38.503337052122234]
Large language models (LLMs) are shown to possess a wealth of actionable knowledge that can be extracted for robot manipulation.
We aim to synthesize robot trajectories for a variety of manipulation tasks given an open-set of instructions and an open-set of objects.
We demonstrate how the proposed framework can benefit from online experiences by efficiently learning a dynamics model for scenes that involve contact-rich interactions.
arXiv Detail & Related papers (2023-07-12T07:40:48Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Polyline Based Generative Navigable Space Segmentation for Autonomous
Visual Navigation [57.3062528453841]
We propose a representation-learning-based framework to enable robots to learn the navigable space segmentation in an unsupervised manner.
We show that the proposed PSV-Nets can learn the visual navigable space with high accuracy, even without any single label.
arXiv Detail & Related papers (2021-10-29T19:50:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.