Fugu-MT 論文翻訳(概要): Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

論文の概要: Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

arxiv url: http://arxiv.org/abs/2512.11574v1
Date: Fri, 12 Dec 2025 14:03:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-15 15:48:11.791082
Title: Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis
Title（参考訳）: 多視点対応解析による基礎モデルの3次元理解の評価
Authors: Valentina Lilova, Toyesh Chakravorty, Julian I. Bibo, Emma Boccaletti, Brandon Li, Lívia Baxová, Cees G. M. Snoek, Mohammadreza Salehi,
Abstract要約: 本稿では, 微調整を必要とせず, 濃密な視覚的特徴の質を直接的に調査する, コンテキスト内3Dシーン理解のための新しいベンチマークを提案する。我々は8つの最先端基盤モデルをベンチマークし、DINOベースのエンコーダが大きな視点シフトで競争力を維持することを示す。
参考スコア（独自算出の注目度）: 38.10984626023432
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Benchmarking 3D spatial understanding of foundation models is essential for real-world applications such as robotics and autonomous driving. Existing evaluations often rely on downstream finetuning with linear heads or task-specific decoders, making it difficult to isolate the intrinsic 3D reasoning ability of pretrained encoders. In this work, we introduce a novel benchmark for in-context 3D scene understanding that requires no finetuning and directly probes the quality of dense visual features. Building on the Hummingbird framework, which evaluates in-context 2D scene understanding, we extend the setup to the 3D Multi-View ImageNet (MVImgNet) dataset. Given a set of images from objects in specific angles (keys), we benchmark the performance of segmenting novel views (queries) and report the scores in 4 categories of easy, medium, hard, and extreme based on the key-query view contrast. We benchmark 8 state-of-the-art foundation models and show DINO-based encoders remain competitive across large viewpoint shifts, while 3D-aware models like VGGT require dedicated multi-view adjustments. Our code is publicly available at https://github.com/ToyeshC/open-hummingbird-3d-eval .
Abstract（参考訳）: 基礎モデルの3次元空間的理解のベンチマークは、ロボット工学や自律運転といった現実の応用に不可欠である。既存の評価は、しばしば線形ヘッドやタスク固有のデコーダによる下流の微調整に依存しており、事前訓練されたエンコーダの本質的な3D推論能力の分離が困難である。そこで本研究では, 微調整を必要とせず, 濃密な視覚的特徴の質を直接的に調査する, コンテキスト内3Dシーン理解のための新しいベンチマークを提案する。コンテキスト内2Dシーン理解を評価するHummingbirdフレームワーク上に構築し,その設定を3D Multi-View ImageNet(MVImgNet)データセットに拡張する。特定の角度(キー)の物体からの画像の集合を考慮し、新規ビュー(クエリ)のセグメンテーション性能をベンチマークし、キークエリのコントラストに基づいて、容易、中、ハード、極端の4つのカテゴリでスコアを報告する。我々は8つの最先端基盤モデルをベンチマークし、DINOベースのエンコーダが大きな視点シフトで競争力を維持することを示した。私たちのコードはhttps://github.com/ToyeshC/open-hummingbird-3d-evalで公開されています。

論文の概要: Evaluating Foundation Models' 3D Understanding Through Multi-View Correspondence Analysis

関連論文リスト