Fugu-MT 論文翻訳(概要): Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

論文の概要: Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

arxiv url: http://arxiv.org/abs/2512.10956v1
Date: Thu, 11 Dec 2025 18:59:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-12 16:15:42.582902
Title: Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision
Title（参考訳）: ステレオ・中層視による動的都市ナビゲーション
Authors: Wentao Zhou, Xuweiyi Chen, Vignesh Rajagopal, Jeffrey Chen, Rohan Chandra, Zezhou Cheng,
Abstract要約: 単眼視と中級視線を無視することは非効率であることを示す。ステレオ入力と深度推定や高密度画素追跡などの明快な中間レベルビジョンでNFMを増強するStereoWalkerを提案する。中間レベルのビジョンによって、StereoWalkerはトレーニングデータのわずか1.5%を使用して最先端のパフォーマンスを達成でき、フルデータを使用して最先端のデータを上回ります。
参考スコア（独自算出の注目度）: 13.586199223564273
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The success of foundation models in language and vision motivated research in fully end-to-end robot navigation foundation models (NFMs). NFMs directly map monocular visual input to control actions and ignore mid-level vision modules (tracking, depth estimation, etc) entirely. While the assumption that vision capabilities will emerge implicitly is compelling, it requires large amounts of pixel-to-action supervision that are difficult to obtain. The challenge is especially pronounced in dynamic and unstructured settings, where robust navigation requires precise geometric and dynamic understanding, while the depth-scale ambiguity in monocular views further limits accurate spatial reasoning. In this paper, we show that relying on monocular vision and ignoring mid-level vision priors is inefficient. We present StereoWalker, which augments NFMs with stereo inputs and explicit mid-level vision such as depth estimation and dense pixel tracking. Our intuition is straightforward: stereo inputs resolve the depth-scale ambiguity, and modern mid-level vision models provide reliable geometric and motion structure in dynamic scenes. We also curate a large stereo navigation dataset with automatic action annotation from Internet stereo videos to support training of StereoWalker and to facilitate future research. Through our experiments, we find that mid-level vision enables StereoWalker to achieve a comparable performance as the state-of-the-art using only 1.5% of the training data, and surpasses the state-of-the-art using the full data. We also observe that stereo vision yields higher navigation performance than monocular input.
Abstract（参考訳）: 言語と視覚における基礎モデルの成功は、完全なエンドツーエンドのロボットナビゲーション基礎モデル(NFM)の研究を動機づけた。 NFMは、モノクロ視覚入力を直接マッピングして、制御アクションをマッピングし、中レベルの視覚モジュール(追跡、深さ推定など)を完全に無視する。視覚能力が暗黙的に現れるという仮定は説得力があるが、取得が困難である大量のピクセル対アクションの監視が必要である。この課題は特に動的かつ非構造的な設定において顕著であり、ロバストなナビゲーションには正確な幾何学的および動的理解が必要であり、一方、モノラルビューの深さスケールの曖昧さはより正確な空間的推論を制限している。本稿では,単眼視と中級視線を無視することは非効率であることを示す。ステレオ入力と深度推定や高密度画素追跡などの明快な中間レベルビジョンでNFMを増強するStereoWalkerを提案する。我々の直感は単純で、ステレオ入力は深度スケールの曖昧さを解消し、現代の中級視覚モデルは動的シーンにおいて信頼できる幾何学的・運動的構造を提供する。また,StereoWalkerのトレーニングと今後の研究を支援するために,インターネットステレオビデオから自動アクションアノテーションを付加した大規模なステレオナビゲーションデータセットをキュレートする。実験の結果、StereoWalkerはトレーニングデータのわずか1.5%しか使用せず、フルデータを使用して最先端の技術を上回り、最先端のパフォーマンスを達成できることが判明した。また,立体視は単分子入力よりもナビゲーション性能が高いことも観察した。

論文の概要: Empowering Dynamic Urban Navigation with Stereo and Mid-Level Vision

関連論文リスト