Fugu-MT 論文翻訳(概要): SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images

論文の概要: SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images

arxiv url: http://arxiv.org/abs/2603.19926v1
Date: Fri, 20 Mar 2026 13:10:46 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-23 19:48:39.14763
Title: SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images
Title（参考訳）: SegVGGT:マルチビュー画像からの3次元再構成とインスタンス分割
Authors: Jinyuan Qu, Hongyang Li, Lei Zhang,
Abstract要約: SegVGGTは、フィードフォワード3D再構成とインスタンスセグメンテーションを同時に実行する統合エンドツーエンドフレームワークである。本手法は,視覚幾何学的基底変換器にインスタンス識別を深く統合する。実験により、ScanNetv2とScanNet200でSegVGGTが最先端のパフォーマンスを達成することが示された。
参考スコア（独自算出の注目度）: 11.617237358347777
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: 3D instance segmentation methods typically rely on high-quality point clouds or posed RGB-D scans, requiring complex multi-stage processing pipelines, and are highly sensitive to reconstruction noise. While recent feed-forward transformers have revolutionized multi-view 3D reconstruction, they remain decoupled from high-level semantic understanding. In this work, we present SegVGGT, a unified end-to-end framework that simultaneously performs feed-forward 3D reconstruction and instance segmentation directly from multi-view RGB images. By introducing object queries that interact with multi-level geometric features, our method deeply integrates instance identification into the visual geometry grounded transformer. To address the severe attention dispersion problem caused by the massive number of global image tokens, we propose the Frame-level Attention Distribution Alignment (FADA) strategy. FADA explicitly guides object queries to attend to instance-relevant frames during training, providing structured supervision without extra inference overhead. Extensive experiments demonstrate that SegVGGT achieves the state-of-the-art performance on ScanNetv2 and ScanNet200, outperforming both recent joint models and RGB-D-based approaches, while exhibiting strong generalization capabilities on ScanNet++.
Abstract（参考訳）: 3Dインスタンスセグメンテーション法は、通常、高品質の点雲やRGB-Dスキャンに依存し、複雑な多段階処理パイプラインを必要とし、再構成ノイズに非常に敏感である。最近のフィードフォワードトランスフォーマーは、マルチビュー3D再構成に革命をもたらしたが、ハイレベルなセマンティック理解から切り離されたままである。本稿では,多視点RGB画像から直接フィードフォワード3D再構成とインスタンスセグメンテーションを同時に行う統合エンドツーエンドフレームワークであるSegVGGTを提案する。マルチレベルな幾何学的特徴と相互作用するオブジェクトクエリを導入することで,視覚幾何学的基底変換器のインスタンス識別を深く統合する。大量のグローバル画像トークンによる注意分散問題に対処するために,フレームレベルの注意分布アライメント(FADA)戦略を提案する。 FADAは、トレーニング中のインスタンス関連フレームへのオブジェクトクエリを明示的にガイドし、追加の推論オーバーヘッドなしに構造化された監視を提供する。大規模な実験では、ScanNetv2とScanNet200でSegVGGTが最先端のパフォーマンスを実現し、最近のジョイントモデルとRGB-Dベースのアプローチの両方を上回り、ScanNet++上で強力な一般化能力を示している。

論文の概要: SegVGGT: Joint 3D Reconstruction and Instance Segmentation from Multi-View Images

関連論文リスト