Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models
- URL: http://arxiv.org/abs/2310.07937v3
- Date: Tue, 06 May 2025 14:06:58 GMT
- Title: Co-NavGPT: Multi-Robot Cooperative Visual Semantic Navigation Using Vision Language Models
- Authors: Bangguo Yu, Qihao Yuan, Kailai Li, Hamidreza Kasaei, Ming Cao,
- Abstract summary: Co-NavGPT is a novel framework that integrates a Vision Language Model (VLM) as a global planner.<n>Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map.<n>The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration.
- Score: 8.668211481067457
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual target navigation is a critical capability for autonomous robots operating in unknown environments, particularly in human-robot interaction scenarios. While classical and learning-based methods have shown promise, most existing approaches lack common-sense reasoning and are typically designed for single-robot settings, leading to reduced efficiency and robustness in complex environments. To address these limitations, we introduce Co-NavGPT, a novel framework that integrates a Vision Language Model (VLM) as a global planner to enable common-sense multi-robot visual target navigation. Co-NavGPT aggregates sub-maps from multiple robots with diverse viewpoints into a unified global map, encoding robot states and frontier regions. The VLM uses this information to assign frontiers across the robots, facilitating coordinated and efficient exploration. Experiments on the Habitat-Matterport 3D (HM3D) demonstrate that Co-NavGPT outperforms existing baselines in terms of success rate and navigation efficiency, without requiring task-specific training. Ablation studies further confirm the importance of semantic priors from the VLM. We also validate the framework in real-world scenarios using quadrupedal robots. Supplementary video and code are available at: https://sites.google.com/view/co-navgpt2.
Related papers
- General-Purpose Robotic Navigation via LVLM-Orchestrated Perception, Reasoning, and Acting [9.157222032441531]
Agentic Robotic Navigation Architecture (ARNA) is a general-purpose navigation framework that equips an LVLM-based agent with a library of perception, reasoning, and navigation tools.<n>At runtime, the agent autonomously defines and executes task-specific navigation that iteratively query the robotic modules, reason over multimodal inputs, and select appropriate navigation actions.<n>ARNA achieves state-of-the-art performance, demonstrating effective exploration, navigation, and embodied question answering without relying on handcrafted plans, fixed input representations, or pre-existing maps.
arXiv Detail & Related papers (2025-06-20T20:06:14Z) - NavBench: A Unified Robotics Benchmark for Reinforcement Learning-Based Autonomous Navigation [16.554282855005766]
We present NavBench, a benchmark for training and evaluating Reinforcement Learning-based navigation policies.<n>Our framework standardizes task definitions, enabling different robots to tackle various navigation challenges.<n>By ensuring consistency between simulation and real-world deployment, NavBench simplifies the development of RL-based navigation strategies.
arXiv Detail & Related papers (2025-05-20T15:48:23Z) - SayCoNav: Utilizing Large Language Models for Adaptive Collaboration in Decentralized Multi-Robot Navigation [10.877873071364148]
We present SayCoNav, a new approach that leverages large language models (LLMs) for automatically generating this collaboration strategy among a team of robots.<n>We evaluate SayCoNav on Multi-Object Navigation (MultiON) tasks, that require the team of the robots to utilize their complementary strengths to efficiently search multiple different objects in unknown environments.
arXiv Detail & Related papers (2025-05-19T20:58:06Z) - Deploying Foundation Model-Enabled Air and Ground Robots in the Field: Challenges and Opportunities [65.98704516122228]
The integration of foundation models (FMs) into robotics has enabled robots to understand natural language and reason about the semantics in their environments.<n>This paper addresses the deployment of FM-enabled robots in the field, where missions often require a robot to operate in large-scale and unstructured environments.<n>We present the first demonstration of large-scale LLM-enabled robot planning in unstructured environments with several kilometers of missions.
arXiv Detail & Related papers (2025-05-14T15:28:43Z) - VL-Nav: Real-time Vision-Language Navigation with Spatial Reasoning [11.140494493881075]
We present a novel vision-language navigation (VL-Nav) system that integrates efficient spatial reasoning on low-power robots.<n>Unlike prior methods that rely on a single image-level feature similarity to guide a robot, our method integrates pixel-wise vision-language features with curiosity-driven exploration.<n>VL-Nav achieves an overall success rate of 86.3%, outperforming previous methods by 44.15%.
arXiv Detail & Related papers (2025-02-02T21:44:15Z) - LPAC: Learnable Perception-Action-Communication Loops with Applications
to Coverage Control [80.86089324742024]
We propose a learnable Perception-Action-Communication (LPAC) architecture for the problem.
CNN processes localized perception; a graph neural network (GNN) facilitates robot communications.
Evaluations show that the LPAC models outperform standard decentralized and centralized coverage control algorithms.
arXiv Detail & Related papers (2024-01-10T00:08:00Z) - From Simulations to Reality: Enhancing Multi-Robot Exploration for Urban
Search and Rescue [46.377510400989536]
We present a novel hybrid algorithm for efficient multi-robot exploration in unknown environments with limited communication and no global positioning information.
We redefine the local best and global best positions to suit scenarios without continuous target information.
The presented work holds promise for enhancing multi-robot exploration in scenarios with limited information and communication capabilities.
arXiv Detail & Related papers (2023-11-28T17:05:25Z) - NoMaD: Goal Masked Diffusion Policies for Navigation and Exploration [57.15811390835294]
This paper describes how we can train a single unified diffusion policy to handle both goal-directed navigation and goal-agnostic exploration.
We show that this unified policy results in better overall performance when navigating to visually indicated goals in novel environments.
Our experiments, conducted on a real-world mobile robot platform, show effective navigation in unseen environments in comparison with five alternative methods.
arXiv Detail & Related papers (2023-10-11T21:07:14Z) - SayNav: Grounding Large Language Models for Dynamic Planning to Navigation in New Environments [14.179677726976056]
SayNav is a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks.
SayNav achieves state-of-the-art results and even outperforms an oracle based baseline with strong ground-truth assumptions by more than 8% in terms of success rate.
arXiv Detail & Related papers (2023-09-08T02:24:37Z) - Target Search and Navigation in Heterogeneous Robot Systems with Deep
Reinforcement Learning [3.3167319223959373]
We design a heterogeneous robot system consisting of a UAV and a UGV for search and rescue missions in unknown environments.
The system is able to search for targets and navigate to them in a maze-like mine environment with the policies learned through deep reinforcement learning algorithms.
arXiv Detail & Related papers (2023-08-01T07:09:14Z) - Learning Hierarchical Interactive Multi-Object Search for Mobile
Manipulation [10.21450780640562]
We introduce a novel interactive multi-object search task in which a robot has to open doors to navigate rooms and search inside cabinets and drawers to find target objects.
These new challenges require combining manipulation and navigation skills in unexplored environments.
We present HIMOS, a hierarchical reinforcement learning approach that learns to compose exploration, navigation, and manipulation skills.
arXiv Detail & Related papers (2023-07-12T12:25:33Z) - ETPNav: Evolving Topological Planning for Vision-Language Navigation in
Continuous Environments [56.194988818341976]
Vision-language navigation is a task that requires an agent to follow instructions to navigate in environments.
We propose ETPNav, which focuses on two critical skills: 1) the capability to abstract environments and generate long-range navigation plans, and 2) the ability of obstacle-avoiding control in continuous environments.
ETPNav yields more than 10% and 20% improvements over prior state-of-the-art on R2R-CE and RxR-CE datasets.
arXiv Detail & Related papers (2023-04-06T13:07:17Z) - Audio Visual Language Maps for Robot Navigation [30.33041779258644]
We propose Audio-Visual-Language Maps (AVLMaps), a unified 3D spatial map representation for storing cross-modal information from audio, visual, and language cues.
AVLMaps integrate the open-vocabulary capabilities of multimodal foundation models pre-trained on Internet-scale data by fusing their features into a centralized 3D voxel grid.
In the context of navigation, we show that AVLMaps enable robot systems to index goals in the map based on multimodal queries, e.g., textual descriptions, images, or audio snippets of landmarks.
arXiv Detail & Related papers (2023-03-13T23:17:51Z) - GNM: A General Navigation Model to Drive Any Robot [67.40225397212717]
General goal-conditioned model for vision-based navigation can be trained on data obtained from many distinct but structurally similar robots.
We analyze the necessary design decisions for effective data sharing across robots.
We deploy the trained GNM on a range of new robots, including an under quadrotor.
arXiv Detail & Related papers (2022-10-07T07:26:41Z) - LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language,
Vision, and Action [76.71101507291473]
We present a system, LM-Nav, for robotic navigation that enjoys the benefits of training on unannotated large datasets of trajectories.
We show that such a system can be constructed entirely out of pre-trained models for navigation (ViNG), image-language association (CLIP), and language modeling (GPT-3), without requiring any fine-tuning or language-annotated robot data.
arXiv Detail & Related papers (2022-07-10T10:41:50Z) - Socially Compliant Navigation Dataset (SCAND): A Large-Scale Dataset of
Demonstrations for Social Navigation [92.66286342108934]
Social navigation is the capability of an autonomous agent, such as a robot, to navigate in a'socially compliant' manner in the presence of other intelligent agents such as humans.
Our dataset contains 8.7 hours, 138 trajectories, 25 miles of socially compliant, human teleoperated driving demonstrations.
arXiv Detail & Related papers (2022-03-28T19:09:11Z) - Decentralized Global Connectivity Maintenance for Multi-Robot
Navigation: A Reinforcement Learning Approach [12.649986200029717]
This work investigates how to navigate a multi-robot team in unknown environments while maintaining connectivity.
We propose a reinforcement learning approach to develop a decentralized policy, which is shared among multiple robots.
We validate the effectiveness of the proposed approach by comparing different combinations of connectivity constraints and behavior cloning.
arXiv Detail & Related papers (2021-09-17T13:20:19Z) - Collaborative Visual Navigation [69.20264563368762]
We propose a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN)
Diverse MAVN variants are explored to make our problem more general.
A memory-augmented communication framework is proposed. Each agent is equipped with a private, external memory to persistently store communication information.
arXiv Detail & Related papers (2021-07-02T15:48:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.