Fugu-MT 論文翻訳(概要): Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

論文の概要: Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

arxiv url: http://arxiv.org/abs/2604.21713v1
Date: Thu, 23 Apr 2026 14:20:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-24 14:40:06.590675
Title: Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation
Title（参考訳）: 3次元視覚幾何推定のための臨界因子のパワーを解き放つ
Authors: Guangkai Xu, Hua Geng, Huanyi Zheng, Songyi Yin, Yanlong Sun, Hao Chen, Chunhua Shen,
Abstract要約: 本稿では,厳密なアブレーション研究を通じてモデル性能を駆動する重要な要因について検討する。最適化手法と高分解能入力の利点を統合するための2つの拡張を導入する。点雲再構成、ビデオ深度推定、カメラのポーズ/内在推定の実験は、CARVEが様々なベンチマークで強力で堅牢な性能を達成していることを示している。
参考スコア（独自算出の注目度）: 43.14437643346991
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Feed-forward visual geometry estimation has recently made rapid progress. However, an important gap remains: multi-frame models usually produce better cross-frame consistency, yet they often underperform strong per-frame methods on single-frame accuracy. This observation motivates our systematic investigation into the critical factors driving model performance through rigorous ablation studies, which reveals several key insights: 1) Scaling up data diversity and quality unlocks further performance gains even in state-of-the-art visual geometry estimation methods; 2) Commonly adopted confidence-aware loss and gradient-based loss mechanisms may unintentionally hinder performance; 3) Joint supervision through both per-sequence and per-frame alignment improves results, while local region alignment surprisingly degrades performance. Furthermore, we introduce two enhancements to integrate the advantages of optimization-based methods and high-resolution inputs: a consistency loss function that enforces alignment between depth maps, camera parameters, and point maps, and an efficient architectural design that leverages high-resolution information. We integrate these designs into CARVE, a resolution-enhanced model for feed-forward visual geometry estimation. Experiments on point cloud reconstruction, video depth estimation, and camera pose/intrinsic estimation show that CARVE achieves strong and robust performance across diverse benchmarks.
Abstract（参考訳）: フィードフォワードの視覚的幾何推定は、最近急速に進歩している。しかし、重要なギャップは残る: マルチフレームモデルは通常、クロスフレームの一貫性を向上するが、シングルフレームの正確性において、強いフレーム単位のメソッドを過小評価することが多い。この観察は、厳密なアブレーション研究を通じてモデルパフォーマンスを駆動する重要な要因について、系統的な調査を動機付けている。 1) 最先端のビジュアル幾何推定手法においても、データの多様性と品質の増大によりさらなる性能向上が期待できる。 2 一般的に採用されている信頼感喪失及び勾配に基づく損失機構は、意図的に性能を損なう可能性がある。 3)シーケンス毎とフレーム毎のアライメントによる共同管理は結果を改善する一方,局所的なアライメントは驚くほど性能を低下させる。さらに,最適化手法と高解像度入力の利点を統合するために,奥行きマップ,カメラパラメータ,点マップのアライメントを強制する整合損失関数と,高解像度情報を活用する効率的なアーキテクチャ設計を導入する。我々はこれらの設計を、フィードフォワード視覚幾何学推定のための分解能強化モデルであるCARVEに統合する。点雲再構成、ビデオ深度推定、カメラのポーズ/内在推定の実験は、CARVEが様々なベンチマークで強力で堅牢な性能を達成していることを示している。

論文の概要: Unlocking the Power of Critical Factors for 3D Visual Geometry Estimation

関連論文リスト