Fugu-MT 論文翻訳(概要): OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

論文の概要: OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

arxiv url: http://arxiv.org/abs/2603.17351v1
Date: Wed, 18 Mar 2026 04:26:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-21 18:33:56.939169
Title: OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms
Title（参考訳）: OmniVLN:空中・地上プラットフォームを横断する視線ナビゲーションのための全方位3次元知覚とトークン効率のLLM推論
Authors: Zhongyuang Liu, Min He, Shaonan Yu, Xinhang Xu, Muqing Cao, Jianping Li, Jianfei Yang, Lihua Xie,
Abstract要約: 言語誘導型エンボディナビゲーションでは、エージェントがオブジェクト参照命令を解釈し、複数の部屋を探索し、参照されたターゲットをローカライズし、それに対する信頼できる動きを実行する必要がある。 OmniVLNは、全方位3次元知覚とトークン効率の高い階層的推論を、空中と地上の両方で組み合わせたゼロショット視覚言語ナビゲーションフレームワークである。実験により、提案した階層インタフェースは空間参照精度を77.27%から93.18%に改善し、マルチルームの乱雑な設定で累積的なプロンプトトークンを61.7%削減し、フラットで最大11.68%のナビゲーション成功率向上を実現した。
参考スコア（独自算出の注目度）: 33.40889181799252
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Language-guided embodied navigation requires an agent to interpret object-referential instructions, search across multiple rooms, localize the referenced target, and execute reliable motion toward it. Existing systems remain limited in real indoor environments because narrow field-of-view sensing exposes only a partial local scene at each step, often forcing repeated rotations, delaying target discovery, and producing fragmented spatial understanding; meanwhile, directly prompting LLMs with dense 3D maps or exhaustive object lists quickly exceeds the context budget. We present OmniVLN, a zero-shot visual-language navigation framework that couples omnidirectional 3D perception with token-efficient hierarchical reasoning for both aerial and ground robots. OmniVLN fuses a rotating LiDAR and panoramic vision into a hardware-agnostic mapping stack, incrementally constructs a five-layer Dynamic Scene Graph (DSG) from mesh geometry to room- and building-level structure, and stabilizes high-level topology through persistent-homology-based room partitioning and hybrid geometric/VLM relation verification. For navigation, the global DSG is transformed into an agent-centric 3D octant representation with multi-resolution spatial attention prompting, enabling the LLM to progressively filter candidate rooms, infer egocentric orientation, localize target objects, and emit executable navigation primitives while preserving fine local detail and compact long-range memory. Experiments show that the proposed hierarchical interface improves spatial referring accuracy from 77.27\% to 93.18\%, reduces cumulative prompt tokens by up to 61.7\% in cluttered multi-room settings, and improves navigation success by up to 11.68\% over a flat-list baseline. We will release the code and an omnidirectional multimodal dataset to support reproducible research.
Abstract（参考訳）: 言語誘導型エンボディナビゲーションでは、エージェントがオブジェクト参照命令を解釈し、複数の部屋を探索し、参照されたターゲットをローカライズし、それに対する信頼できる動きを実行する必要がある。既存のシステムは実際の屋内環境では限定的であり、狭い視野のセンシングは各ステップで部分的な局所的なシーンのみを露呈し、しばしば繰り返し回転を強制し、目標発見を遅らせ、断片化された空間的理解を生み出す。 OmniVLNは、全方位3次元認識とトークン効率の高い階層的推論を、空中と地上の両方で組み合わせたゼロショット視覚言語ナビゲーションフレームワークである。 OmniVLNは、回転するLiDARとパノラマビジョンをハードウェアに依存しないマッピングスタックに融合し、メッシュ幾何学からルームレベルの構造まで5層動的シーングラフ(DSG)をインクリメンタルに構築し、永続ホモロジーに基づく部屋分割とハイブリッド幾何学/VLM関係検証を通じて高レベルのトポロジを安定化する。ナビゲーションでは,グローバルDSGをエージェント中心の3Dオクタント表現に変換し,複数の空間的注意を喚起し,LLMが候補部屋を段階的にフィルタリングし,エゴセントリックな向きを推測し,ターゲットオブジェクトをローカライズし,実行可能なナビゲーションプリミティブを出力する。実験により,提案した階層インタフェースは,空間参照精度を77.27.%から93.18.%に向上し,マルチルームのマルチルーム設定において累積的なプロンプトトークンを61.7.%まで削減し,フラットリストベースライン上でのナビゲーション成功率を最大11.68.%向上した。再現可能な研究を支援するため、コードと全方位のマルチモーダルデータセットをリリースする。

論文の概要: OmniVLN: Omnidirectional 3D Perception and Token-Efficient LLM Reasoning for Visual-Language Navigation across Air and Ground Platforms

関連論文リスト