Fugu-MT 論文翻訳(概要): GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

論文の概要: GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

arxiv url: http://arxiv.org/abs/2605.22558v1
Date: Thu, 21 May 2026 14:40:03 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 20:14:18.58569
Title: GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning
Title（参考訳）: GeoWeaver: シーン推論の前に幾何学的エビデンスで視覚的トークンを接地する
Authors: Deshui Miao, Xingsen Huang, Yameng Gu, Xin Li, Haijun Zhang, Ming-Hsuan Yang,
Abstract要約: マルチモーダルモデルは、幾何学情報分岐、3D対応の監視、推論段階の融合、ロングホライゾンメモリを含む。これらのアプローチは通常、幾何学的手がかりをすべての視覚的トークン間の共有信号として扱う。時間的推論のための表現的前提条件として幾何学を扱うフレームワークであるGeoWeaverを紹介する。
参考スコア（独自算出の注目度）: 45.229974852899716
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Spatio-temporal reasoning in vision-language models requires visual representations that preserve physical geometry rather than merely semantic appearance. Recent multimodal models incorporate geometric information through structural branches, 3D-aware supervision, reasoning-stage fusion, or long-horizon memory. While these approaches demonstrate the importance of geometry for spatial intelligence, they typically treat geometric cues as a shared signal across all visual tokens. We note that this overlooks a finer-grained challenge: different visual tokens require different geometric evidence depending on their spatial roles. To address this limitation, we introduce GeoWeaver, a pre-reasoning geometric grounding framework that treats geometry as a representational prerequisite for spatio-temporal reasoning. GeoWeaver constructs a multi-level geometry bank from a frozen geometry encoder and performs token-adaptive geometric evidence allocation, enabling each visual token to retrieve the most relevant geometric abstractions. The selected evidence is incorporated into visual tokens via a residual grounding operation prior to language modeling, yielding geometry-grounded representations for downstream reasoning. Extensive evaluations on spatial reasoning benchmarks demonstrate that GeoWeaver consistently enhances geometry-aware reasoning while retaining general multimodal capabilities. This indicates that geometric information yields the greatest benefit not as a late-fusion auxiliary signal but as a fundamental prerequisite that shapes the representational foundation on which large language models perform reasoning. All source code and models will be released at https://github.com/yahooo-m/GeoWeaver .
Abstract（参考訳）: 視覚言語モデルにおける時空間的推論は、単に意味的な外観ではなく、物理的な幾何学を保存する視覚的表現を必要とする。近年のマルチモーダルモデルでは, 構造分岐, 3次元監視, 推論段階融合, 長期記憶などの幾何学的情報を取り入れている。これらのアプローチは、空間知能における幾何学の重要性を示しているが、通常は幾何学的手がかりを全ての視覚的トークンの共有信号として扱う。異なる視覚トークンは、空間的役割によって異なる幾何学的証拠を必要とする。この制限に対処するために、時空間推論のための表現的前提条件として幾何学を扱う事前推論型幾何基底フレームワークであるGeoWeaverを導入する。 GeoWeaverは、凍結した幾何学エンコーダから多層幾何学バンクを構築し、トークン適応的な幾何学的エビデンスアロケーションを実行し、各ビジュアルトークンが最も関連する幾何学的抽象化を検索できるようにする。選択された証拠は、言語モデリングに先立って残差接地操作によって視覚トークンに組み込まれ、下流の推論のための幾何学的接地表現が得られる。空間的推論ベンチマークの大規模な評価は、GeoWeaverが一般的なマルチモーダル能力を保ちながら、幾何認識推論を一貫して強化していることを示している。これは、幾何学情報が遅延融合補助信号としてではなく、大きな言語モデルが推論を行う表現基盤を形成する基本的な前提条件として最大の利益をもたらすことを示している。すべてのソースコードとモデルはhttps://github.com/yahooo-m/GeoWeaver でリリースされる。

論文の概要: GeoWeaver: Grounding Visual Tokens with Geometric Evidence before Scene Reasoning

関連論文リスト