Fugu-MT 論文翻訳(概要): AVGGT: Rethinking Global Attention for Accelerating VGGT

論文の概要: AVGGT: Rethinking Global Attention for Accelerating VGGT

arxiv url: http://arxiv.org/abs/2512.02541v1
Date: Tue, 02 Dec 2025 09:08:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-03 21:04:45.795588
Title: AVGGT: Rethinking Global Attention for Accelerating VGGT
Title（参考訳）: AVGGT:VGGTの加速に向けた世界的意識の再考
Authors: Xianbing Sun, Zhikai Zhu, Zhengyu Lou, Bo Yang, Jinyang Tang, Liqing Zhang, He Wang, Jianfu Zhang,
Abstract要約: VGGTと3ドルは、強力なマルチビュー3Dパフォーマンスを示しているが、グローバルな自己注意に大きく依存しているため、計算コストが高い。我々は、VGGTにおけるグローバルアテンションモジュールの詳細な調査を行い、それらの役割をよりよく理解するために3ドルを支払った。本研究では,(1)初期のグローバルレイヤをフレームアテンションに変換し,(2)グローバルアテンションをサブサンプリングする2段階アクセラレーション方式を提案する。
参考スコア（独自算出の注目度）: 16.56994879750844
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Since DUSt3R, models such as VGGT and $π^3$ have shown strong multi-view 3D performance, but their heavy reliance on global self-attention results in high computational cost. Existing sparse-attention variants offer partial speedups, yet lack a systematic analysis of how global attention contributes to multi-view reasoning. In this paper, we first conduct an in-depth investigation of the global attention modules in VGGT and $π^3$ to better understand their roles. Our analysis reveals a clear division of roles in the alternating global-frame architecture: early global layers do not form meaningful correspondences, middle layers perform cross-view alignment, and last layers provide only minor refinements. Guided by these findings, we propose a training-free two-step acceleration scheme: (1) converting early global layers into frame attention, and (2) subsampling global attention by subsampling K/V over patch tokens with diagonal preservation and a mean-fill component. We instantiate this strategy on VGGT and $π^3$ and evaluate across standard pose and point-map benchmarks. Our method achieves up to $8$-$10\times$ speedup in inference time while matching or slightly improving the accuracy of the original models, and remains robust even in extremely dense multi-view settings where prior sparse-attention baselines fail.
Abstract（参考訳）: DUSt3R 以降、VGGT や $π^3$ などのモデルでは、多視点3D の性能は高いが、グローバルな自己注意に依存しているため、計算コストが高い。既存のスパースアテンション変種は部分的なスピードアップを提供するが、グローバルアテンションがマルチビュー推論にどのように貢献するかの体系的な分析を欠いている。本稿では、まず、VGGTにおけるグローバルアテンションモジュールと、それらの役割をよりよく理解するために、π^3$の詳細な調査を行う。初期のグローバルレイヤは意味のある対応を形成せず、中層はクロスビューアライメントを行い、最後のレイヤは微妙な改善のみを提供します。そこで本研究では,(1) 初期のグローバルレイヤをフレームアテンションに変換すること,(2) K/V を対角保存と平均充填成分のパッチトークンにサブサンプリングすることで,グローバルアテンションをサブサンプリングすること,という2段階のトレーニングフリーな促進手法を提案する。 VGGTと$π^3$でこの戦略をインスタンス化し、標準ポーズとポイントマップのベンチマークで評価する。提案手法は,従来のモデルとのマッチングや精度をわずかに向上しながら,最大8ドルから10ドル程度の速度アップを達成し,事前のスパースアテンションベースラインが失敗する非常に高密度なマルチビュー設定でも頑健である。

論文の概要: AVGGT: Rethinking Global Attention for Accelerating VGGT

関連論文リスト