Related papers: LOVON: Legged Open-Vocabulary Object Navigator

LOVON: Legged Open-Vocabulary Object Navigator

URL: http://arxiv.org/abs/2507.06747v1
Date: Wed, 09 Jul 2025 11:02:46 GMT
Title: LOVON: Legged Open-Vocabulary Object Navigator
Authors: Daojie Peng, Jiahang Cao, Qiang Zhang, Jun Ma,
Abstract summary: We propose a novel framework that integrates large language models for hierarchical task planning with open-vocabulary visual detection models.<n>To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions.<n>We also develop a functional execution logic for the robot that guarantees LOVON's capabilities in autonomous navigation, task adaptation, and robust task completion.
Score: 9.600429521100041
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Object navigation in open-world environments remains a formidable and pervasive challenge for robotic systems, particularly when it comes to executing long-horizon tasks that require both open-world object detection and high-level task planning. Traditional methods often struggle to integrate these components effectively, and this limits their capability to deal with complex, long-range navigation missions. In this paper, we propose LOVON, a novel framework that integrates large language models (LLMs) for hierarchical task planning with open-vocabulary visual detection models, tailored for effective long-range object navigation in dynamic, unstructured environments. To tackle real-world challenges including visual jittering, blind zones, and temporary target loss, we design dedicated solutions such as Laplacian Variance Filtering for visual stabilization. We also develop a functional execution logic for the robot that guarantees LOVON's capabilities in autonomous navigation, task adaptation, and robust task completion. Extensive evaluations demonstrate the successful completion of long-sequence tasks involving real-time detection, search, and navigation toward open-vocabulary dynamic targets. Furthermore, real-world experiments across different legged robots (Unitree Go2, B2, and H1-2) showcase the compatibility and appealing plug-and-play feature of LOVON.

Related papers

Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System [7.266794815157721]
We propose a hierarchical framework integrating a prompted Large Language Model (LLM) and a fine-tuned Vision Language Model (VLM)<n>The LLM decomposes tasks and constructs a global semantic map, while the VLM extracts task-specified semantic labels and 2D spatial information from aerial images to support local planning.<n>This is the first demonstration of an aerial-ground heterogeneous system integrating VLM-based perception with LLM-driven task reasoning and motion planning.
arXiv Detail & Related papers (2025-06-05T13:27:41Z)
ATLASv2: LLM-Guided Adaptive Landmark Acquisition and Navigation on the Edge [0.5243460995467893]
ATLASv2 is a novel system that integrates a fine-tuned TinyLLM, real-time object detection, and efficient path planning.<n>We evaluate ATLASv2 in real-world environments, including a handcrafted home and office setting constructed with diverse objects and landmarks.<n>Results show that ATLASv2 effectively interprets natural language instructions, decomposes them into low-level actions, and executes tasks with high success rates.
arXiv Detail & Related papers (2025-04-15T00:55:57Z)
DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research. We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair. We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z)
Cognitive Planning for Object Goal Navigation using Generative AI Models [0.979851640406258]
We present a novel framework for solving the object goal navigation problem that generates efficient exploration strategies. Our approach enables a robot to navigate unfamiliar environments by leveraging Large Language Models (LLMs) and Large Vision-Language Models (LVLMs)
arXiv Detail & Related papers (2024-03-30T10:54:59Z)
Generalizable Long-Horizon Manipulations with Large Language Models [91.740084601715]
This work introduces a framework harnessing the capabilities of Large Language Models (LLMs) to generate primitive task conditions for generalizable long-horizon manipulations. We create a challenging robotic manipulation task suite based on Pybullet for long-horizon task evaluation.
arXiv Detail & Related papers (2023-10-03T17:59:46Z)
Learning Hierarchical Interactive Multi-Object Search for Mobile Manipulation [10.21450780640562]
We introduce a novel interactive multi-object search task in which a robot has to open doors to navigate rooms and search inside cabinets and drawers to find target objects. These new challenges require combining manipulation and navigation skills in unexplored environments. We present HIMOS, a hierarchical reinforcement learning approach that learns to compose exploration, navigation, and manipulation skills.
arXiv Detail & Related papers (2023-07-12T12:25:33Z)
Long-HOT: A Modular Hierarchical Approach for Long-Horizon Object Transport [83.06265788137443]
We address key challenges in long-horizon embodied exploration and navigation by proposing a new object transport task and a novel modular framework for temporally extended navigation. Our first contribution is the design of a novel Long-HOT environment focused on deep exploration and long-horizon planning. We propose a modular hierarchical transport policy (HTP) that builds a topological graph of the scene to perform exploration with the help of weighted frontiers.
arXiv Detail & Related papers (2022-10-28T05:30:49Z)
Embodied BERT: A Transformer Model for Embodied, Language-guided Visual Task Completion [69.04196388421649]
We present Embodied BERT (EmBERT), a transformer-based model which can attend to high-dimensional, multi-modal inputs across long temporal horizons for language-conditioned task completion. We achieve competitive performance on the ALFRED benchmark, and EmBERT marks the first transformer-based model to successfully handle the long-horizon, dense, multi-modal histories of ALFRED.
arXiv Detail & Related papers (2021-08-10T21:24:05Z)
Simultaneous Navigation and Construction Benchmarking Environments [73.0706832393065]
We need intelligent robots for mobile construction, the process of navigating in an environment and modifying its structure according to a geometric design. In this task, a major robot vision and learning challenge is how to exactly achieve the design without GPS. We benchmark the performance of a handcrafted policy with basic localization and planning, and state-of-the-art deep reinforcement learning methods.
arXiv Detail & Related papers (2021-03-31T00:05:54Z)
Modeling Long-horizon Tasks as Sequential Interaction Landscapes [75.5824586200507]
We present a deep learning network that learns dependencies and transitions across subtasks solely from a set of demonstration videos. We show that these symbols can be learned and predicted directly from image observations. We evaluate our framework on two long horizon tasks: (1) block stacking of puzzle pieces being executed by humans, and (2) a robot manipulation task involving pick and place of objects and sliding a cabinet door with a 7-DoF robot arm.
arXiv Detail & Related papers (2020-06-08T18:07:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.