AirHunt: Bridging VLM Semantics and Continuous Planning for Efficient Aerial Object Navigation
- URL: http://arxiv.org/abs/2601.12742v1
- Date: Mon, 19 Jan 2026 05:50:03 GMT
- Title: AirHunt: Bridging VLM Semantics and Continuous Planning for Efficient Aerial Object Navigation
- Authors: Xuecheng Chen, Zongzhuo Liu, Jianfa Ma, Bang Du, Tiantian Zhang, Xueqian Wang, Boyu Zhou,
- Abstract summary: AirHunt is an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments.<n>AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM semantic reasoning and path planning.<n>We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time.
- Score: 13.973823761671673
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in large Vision-Language Models (VLMs) have provided rich semantic understanding that empowers drones to search for open-set objects via natural language instructions. However, prior systems struggle to integrate VLMs into practical aerial systems due to orders-of-magnitude frequency mismatch between VLM inference and real-time planning, as well as VLMs' limited 3D scene understanding. They also lack a unified mechanism to balance semantic guidance with motion efficiency in large-scale environments. To address these challenges, we present AirHunt, an aerial object navigation system that efficiently locates open-set objects with zero-shot generalization in outdoor environments by seamlessly fusing VLM semantic reasoning with continuous path planning. AirHunt features a dual-pathway asynchronous architecture that establishes a synergistic interface between VLM reasoning and path planning, enabling continuous flight with adaptive semantic guidance that evolves through motion. Moreover, we propose an active dual-task reasoning module that exploits geometric and semantic redundancy to enable selective VLM querying, and a semantic-geometric coherent planning module that dynamically reconciles semantic priorities and motion efficiency in a unified framework, enabling seamless adaptation to environmental heterogeneity. We evaluate AirHunt across diverse object navigation tasks and environments, demonstrating a higher success rate with lower navigation error and reduced flight time compared to state-of-the-art methods. Real-world experiments further validate AirHunt's practical capability in complex and challenging environments. Code and dataset will be made publicly available before publication.
Related papers
- OpenFrontier: General Navigation with Visual-Language Grounded Frontiers [54.661157616245966]
Open-world navigation requires robots to make decisions in complex everyday environments.<n>Recent advances in vision--language navigation (VLN) and vision--language--action (VLA) models enable end-to-end policies conditioned on natural language.<n>We propose OpenFrontier, a training-free navigation framework that seamlessly integrates diverse vision--language prior models.
arXiv Detail & Related papers (2026-03-05T17:02:22Z) - SGDrive: Scene-to-Goal Hierarchical World Cognition for Autonomous Driving [52.02379432801349]
We propose SGDrive, a novel framework that structures the VLM's representation learning around driving-specific knowledge hierarchies.<n>Built upon a pre-trained VLM backbone, SGDrive decomposes driving understanding into a scene-agent-goal hierarchy that mirrors human driving cognition.
arXiv Detail & Related papers (2026-01-09T08:55:42Z) - Aerial Vision-Language Navigation with a Unified Framework for Spatial, Temporal and Embodied Reasoning [5.517595398768408]
We present a unified aerial VLN framework that operates solely on ego monocular RGB observations and natural language instructions.<n>This task holds promise for real-world applications such as low-altitude inspection, search-and-rescue, and autonomous aerial delivery.
arXiv Detail & Related papers (2025-12-09T14:25:24Z) - Hierarchical Task Offloading and Trajectory Optimization in Low-Altitude Intelligent Networks Via Auction and Diffusion-based MARL [37.79695337425523]
Low-altitude intelligent networks (LAINs) can support mission-critical applications such as disaster response, environmental monitoring, and real-time sensing.<n>These systems face key challenges, including energy-constrained UAVs, task arrivals, and heterogeneous computing resources.<n>We propose an integrated air-ground collaborative network and formulate a time-dependent integer nonlinear programming problem that jointly optimize UAV trajectory planning and task offloading decisions.
arXiv Detail & Related papers (2025-12-05T08:14:45Z) - AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z) - Executable Analytic Concepts as the Missing Link Between VLM Insight and Precise Manipulation [70.8381970762877]
Vision-Language Models (VLMs) have demonstrated remarkable capabilities in semantic reasoning and task planning.<n>We introduce GRACE, a novel framework that grounds VLM-based reasoning through executable analytic concepts.<n>G GRACE provides a unified and interpretable interface between high-level instruction understanding and low-level robot control.
arXiv Detail & Related papers (2025-10-09T09:08:33Z) - UAV-ON: A Benchmark for Open-World Object Goal Navigation with Aerial Agents [17.86691411018085]
UAV-ON is a benchmark for large-scale Object Goal Navigation (NavObject) by aerial agents in open-world environments.<n>It comprises 14 high-fidelity Unreal Engine environments with diverse semantic regions and complex spatial layouts.<n>It defines 1270 annotated target objects, each characterized by an instance-level instruction that encodes category, physical footprint, and visual descriptors.
arXiv Detail & Related papers (2025-08-01T03:23:06Z) - Hierarchical Language Models for Semantic Navigation and Manipulation in an Aerial-Ground Robotic System [8.88014241557266]
Heterogeneous multirobot systems show great potential in complex tasks requiring coordinated hybrid cooperation.<n>Existing methods that rely on static or task-specific models often lack generalizability across diverse tasks and dynamic environments.<n>We propose a hierarchical multimodal framework that integrates a prompted large language model (LLM) with a fine-tuned vision-language model (VLM)
arXiv Detail & Related papers (2025-06-05T13:27:41Z) - Adaptive Interactive Navigation of Quadruped Robots using Large Language Models [14.14967096139099]
We present a primitive tree for task planning with large language models (LLMs)<n>We adopt reinforcement learning to pre-train a comprehensive skill library containing versatile locomotion and interaction behaviors for motion planning.<n> integrated with the tree structure, the replanning mechanism allows for convenient node addition and pruning.
arXiv Detail & Related papers (2025-03-29T02:17:52Z) - Navigating Motion Agents in Dynamic and Cluttered Environments through LLM Reasoning [69.5875073447454]
This paper advances motion agents empowered by large language models (LLMs) toward autonomous navigation in dynamic and cluttered environments.<n>Our training-free framework supports multi-agent coordination, closed-loop replanning, and dynamic obstacle avoidance without retraining or fine-tuning.
arXiv Detail & Related papers (2025-03-10T13:39:09Z) - Large-scale Autonomous Flight with Real-time Semantic SLAM under Dense
Forest Canopy [48.51396198176273]
We propose an integrated system that can perform large-scale autonomous flights and real-time semantic mapping in challenging under-canopy environments.
We detect and model tree trunks and ground planes from LiDAR data, which are associated across scans and used to constrain robot poses as well as tree trunk models.
A drift-compensation mechanism is designed to minimize the odometry drift using semantic SLAM outputs in real time, while maintaining planner optimality and controller stability.
arXiv Detail & Related papers (2021-09-14T07:24:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.