Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation
- URL: http://arxiv.org/abs/2602.15875v1
- Date: Mon, 02 Feb 2026 09:06:50 GMT
- Title: Fly0: Decoupling Semantic Grounding from Geometric Planning for Zero-Shot Aerial Navigation
- Authors: Zhenxing Xu, Brikit Lu, Weidong Bao, Zhengqiu Zhu, Junsong Zhang, Hui Yan, Wenhao Lu, Ji Wang,
- Abstract summary: Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision.<n>We propose Fly0, a framework that decouples semantic reasoning from geometric planning.<n>Fly0 reduces computational overhead and improves system stability.
- Score: 14.466092698477858
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current Visual-Language Navigation (VLN) methodologies face a trade-off between semantic understanding and control precision. While Multimodal Large Language Models (MLLMs) offer superior reasoning, deploying them as low-level controllers leads to high latency, trajectory oscillations, and poor generalization due to weak geometric grounding. To address these limitations, we propose Fly0, a framework that decouples semantic reasoning from geometric planning. The proposed method operates through a three-stage pipeline: (1) an MLLM-driven module for grounding natural language instructions into 2D pixel coordinates; (2) a geometric projection module that utilizes depth data to localize targets in 3D space; and (3) a geometric planner that generates collision-free trajectories. This mechanism enables robust navigation even when visual contact is lost. By eliminating the need for continuous inference, Fly0 reduces computational overhead and improves system stability. Extensive experiments in simulation and real-world environments demonstrate that Fly0 outperforms state-of-the-art baselines, improving the Success Rate by over 20\% and reducing Navigation Error (NE) by approximately 50\% in unstructured environments. Our code is available at https://github.com/xuzhenxing1/Fly0.
Related papers
- LoGoPlanner: Localization Grounded Navigation Policy with Metric-aware Visual Geometry [41.054069737969876]
Trajectory planning in unstructured environments is a fundamental and challenging capability for mobile robots.<n>We introduce LoGoPlanner, a localization-grounded, end-to-end navigation framework.<n>We evaluate LoGoPlanner in both simulation and real-world settings, where its fully end-to-end design reduces cumulative error.
arXiv Detail & Related papers (2025-12-22T18:03:08Z) - D3D-VLP: Dynamic 3D Vision-Language-Planning Model for Embodied Grounding and Navigation [66.7166217399105]
Embodied agents face a critical dilemma that end-to-end models lack interpretability and explicit 3D reasoning.<n>Our model introduces two key innovations: 1) A Dynamic 3D Chain-of-Thought (3D CoT) that unifies planning, grounding, navigation, and question answering within a single 3D-VLM and CoT pipeline; 2) A Synergistic Learning from Fragmented Supervision (SLFS) strategy, which uses a masked autoregressive loss to learn from massive and partially-annotated hybrid data.
arXiv Detail & Related papers (2025-12-14T09:53:15Z) - Let Language Constrain Geometry: Vision-Language Models as Semantic and Spatial Critics for 3D Generation [34.44214123004662]
We propose VLM3D, a framework for differentiable semantic and spatial critics.<n>Our core contribution is a dual-language critic signal derived from the VLM's Yes or No log-odds.<n>VLM3D establishes a principled and general path to inject the VLM's rich, language-grounded understanding of both semantics and space into diverse 3D generative pipelines.
arXiv Detail & Related papers (2025-11-18T09:05:26Z) - Beyond Flatlands: Unlocking Spatial Intelligence by Decoupling 3D Reasoning from Numerical Regression [12.590536117486257]
Existing Vision Language Models (VLMs) struggle to comprehend real-world 3D spatial intelligence.<n> GEODE augments main VLM with two specialized, plug-and-play modules.<n>The synergy of these modules allows our 1.5B parameter model to function as a high-level semantic dispatcher.
arXiv Detail & Related papers (2025-11-14T12:42:07Z) - VLN-Zero: Rapid Exploration and Cache-Enabled Neurosymbolic Vision-Language Planning for Zero-Shot Transfer in Robot Navigation [52.00474922315126]
We present VLN-Zero, a vision-language navigation framework for unseen environments.<n>We use vision-language models to efficiently construct symbolic scene graphs and enable zero-shot neurosymbolic navigation.<n>VLN-Zero achieves 2x higher success rate compared to state-of-the-art zero-shot models, outperforms most fine-tuned baselines, and reaches goal locations in half the time.
arXiv Detail & Related papers (2025-09-23T03:23:03Z) - TANGO: Traversability-Aware Navigation with Local Metric Control for Topological Goals [10.69725316052444]
We present a novel RGB-only, object-level topometric navigation pipeline that enables zero-shot, long-horizon robot navigation.<n>Our approach integrates global topological path planning with local metric trajectory control, allowing the robot to navigate towards object-level sub-goals while avoiding obstacles.<n>We demonstrate the effectiveness of our method in both simulated environments and real-world tests, highlighting its robustness and deployability.
arXiv Detail & Related papers (2025-09-10T15:43:32Z) - DAgger Diffusion Navigation: DAgger Boosted Diffusion Policy for Vision-Language Navigation [73.80968452950854]
Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural language instructions through free-form 3D spaces.<n>Existing VLN-CE approaches typically use a two-stage waypoint planning framework.<n>We propose DAgger Diffusion Navigation (DifNav) as an end-to-end optimized VLN-CE policy.
arXiv Detail & Related papers (2025-08-13T02:51:43Z) - Affordances-Oriented Planning using Foundation Models for Continuous Vision-Language Navigation [64.84996994779443]
We propose a novel Affordances-Oriented Planner for continuous vision-language navigation (VLN) task.
Our AO-Planner integrates various foundation models to achieve affordances-oriented low-level motion planning and high-level decision-making.
Experiments on the challenging R2R-CE and RxR-CE datasets show that AO-Planner achieves state-of-the-art zero-shot performance.
arXiv Detail & Related papers (2024-07-08T12:52:46Z) - ParaPoint: Learning Global Free-Boundary Surface Parameterization of 3D Point Clouds [52.03819676074455]
ParaPoint is an unsupervised neural learning pipeline for achieving global free-boundary surface parameterization.
This work makes the first attempt to investigate neural point cloud parameterization that pursues both global mappings and free boundaries.
arXiv Detail & Related papers (2024-03-15T14:35:05Z) - Learning Forward Dynamics Model and Informed Trajectory Sampler for Safe
Quadruped Navigation [1.2783783498844021]
A typical SOTA system is composed of four main modules -- mapper, global planner, local planner, and command-tracking controller.
We build a robust and safe local planner which is designed to generate a velocity plan to track a coarsely planned path from the global planner.
Using our framework, a quadruped robot can autonomously navigate in various complex environments without a collision and generate a smoother command plan compared to the baseline method.
arXiv Detail & Related papers (2022-04-19T04:01:44Z) - IDEA-Net: Dynamic 3D Point Cloud Interpolation via Deep Embedding
Alignment [58.8330387551499]
We formulate the problem as estimation of point-wise trajectories (i.e., smooth curves)
We propose IDEA-Net, an end-to-end deep learning framework, which disentangles the problem under the assistance of the explicitly learned temporal consistency.
We demonstrate the effectiveness of our method on various point cloud sequences and observe large improvement over state-of-the-art methods both quantitatively and visually.
arXiv Detail & Related papers (2022-03-22T10:14:08Z) - Cylindrical and Asymmetrical 3D Convolution Networks for LiDAR
Segmentation [81.02742110604161]
State-of-the-art methods for large-scale driving-scene LiDAR segmentation often project the point clouds to 2D space and then process them via 2D convolution.
We propose a new framework for the outdoor LiDAR segmentation, where cylindrical partition and asymmetrical 3D convolution networks are designed to explore the 3D geometric pat-tern.
Our method achieves the 1st place in the leaderboard of Semantic KITTI and outperforms existing methods on nuScenes with a noticeable margin, about 4%.
arXiv Detail & Related papers (2020-11-19T18:53:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.