Fugu-MT 論文翻訳(概要): Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

論文の概要: Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

arxiv url: http://arxiv.org/abs/2510.03104v1
Date: Fri, 03 Oct 2025 15:32:56 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.456574
Title: Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields
Title（参考訳）: Geometry meets Vision: Revising Pretrained Semantics in Distilled Fields
Authors: Zhiting Mei, Ola Shorinwa, Anirudha Majumdar,
Abstract要約: 本稿では, 蒸留セマンティクスを用いた粗いインバージョンと光度に基づく最適化による微細インバージョンという2つのコア成分からなる, 初期推定なしで放射界を反転させる新しいフレームワークを提案する。以上の結果から,視覚のみの特徴はより幅広い下流タスクに対してより汎用性をもたらすことが示唆されるが,幾何学的特徴にはより幾何学的詳細が含まれている。
参考スコア（独自算出の注目度）: 11.251320289181338
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Semantic distillation in radiance fields has spurred significant advances in open-vocabulary robot policies, e.g., in manipulation and navigation, founded on pretrained semantics from large vision models. While prior work has demonstrated the effectiveness of visual-only semantic features (e.g., DINO and CLIP) in Gaussian Splatting and neural radiance fields, the potential benefit of geometry-grounding in distilled fields remains an open question. In principle, visual-geometry features seem very promising for spatial tasks such as pose estimation, prompting the question: Do geometry-grounded semantic features offer an edge in distilled fields? Specifically, we ask three critical questions: First, does spatial-grounding produce higher-fidelity geometry-aware semantic features? We find that image features from geometry-grounded backbones contain finer structural details compared to their counterparts. Secondly, does geometry-grounding improve semantic object localization? We observe no significant difference in this task. Thirdly, does geometry-grounding enable higher-accuracy radiance field inversion? Given the limitations of prior work and their lack of semantics integration, we propose a novel framework SPINE for inverting radiance fields without an initial guess, consisting of two core components: coarse inversion using distilled semantics, and fine inversion using photometric-based optimization. Surprisingly, we find that the pose estimation accuracy decreases with geometry-grounded features. Our results suggest that visual-only features offer greater versatility for a broader range of downstream tasks, although geometry-grounded features contain more geometric detail. Notably, our findings underscore the necessity of future research on effective strategies for geometry-grounding that augment the versatility and performance of pretrained semantic features.
Abstract（参考訳）: ラディアンス分野におけるセマンティック蒸留は、例えば操作とナビゲーションにおいて、大きな視覚モデルから事前訓練された意味論に基づいて、オープン語彙ロボットのポリシーに大きな進歩をもたらした。従来の研究は、ガウススプラッティングやニューラルラディアンスフィールドにおける視覚のみの意味的特徴(例えば、DINO、CLIP)の有効性を実証してきたが、蒸留場における幾何学的グラウンドの潜在的な利点は、未解決の問題である。原則として、視覚幾何学の特徴は、ポーズ推定のような空間的なタスクに非常に有望であるように思える。第一に、空間的接地は高忠実度幾何認識のセマンティックな特徴を生み出すか? また, 画像の特徴として, 背骨の形状が, 背骨の形状よりも微細な構造を呈していることが判明した。第二に、幾何学的接地は意味的オブジェクトの局在を改善するか? 私たちはこの仕事で有意な違いは見つからない。第三に、幾何グラウンド化は高精度な放射場逆転を可能にするか? 先行研究の限界とセマンティクス統合の欠如を踏まえ、蒸留セマンティクスを用いた粗いインバージョンと測光に基づく最適化による微細インバージョンという2つのコア成分からなる、初期推定なしで放射界を反転させる新しいフレームワークSPINEを提案する。驚くべきことに、幾何学的特徴により、ポーズ推定精度が低下することがわかった。以上の結果から,視覚のみの特徴はより幅広い下流タスクに対してより汎用性をもたらすことが示唆されるが,幾何学的特徴にはより幾何学的詳細が含まれている。特に,本研究は,事前学習した意味的特徴の汎用性と性能を向上する幾何学的接地戦略の今後の研究の必要性を浮き彫りにしている。

論文の概要: Geometry Meets Vision: Revisiting Pretrained Semantics in Distilled Fields

関連論文リスト