Fugu-MT 論文翻訳(概要): LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

論文の概要: LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

arxiv url: http://arxiv.org/abs/2603.12166v1
Date: Thu, 12 Mar 2026 17:01:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:26.240773
Title: LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning
Title（参考訳）: LatentGeo:マルチモーダル幾何推論のための潜在空間における学習可能な補助構造
Authors: Haiying Xu, Zihan Wang, Song Dai, Zhengxuan Zhang, Kairan Dou, Xuming Hu,
Abstract要約: 画素レベルのレンダリングや外部エグゼキュータを使わずに、連続潜時視覚表現を学習し、補助幾何学的構成を内部化するフレームワークを提案する。 LatentGeoは幾何学的推論タスク、特に補助的な構成を必要とするタスクでかなりの利益を得ている。
参考スコア（独自算出の注目度）: 32.39048489202347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite recent advances in multimodal reasoning, representing auxiliary geometric constructions remains a fundamental challenge for multimodal large language models (MLLMs). Such constructions are absent from the original diagram and must be introduced before theorems apply. Existing approaches predominantly rely on explicit construction paradigms, including text-based geometric specification, visual-token interleaving during reasoning, and tool-augmented geometric execution. However, these methods either fail to faithfully represent complex spatial relationships, incur representation mismatch between discrete symbols and continuous geometric structures, or rely on external capabilities that hinder end-to-end optimization. To address these limitations, we propose LatentGeo, a framework that learns continuous latent visual representations to internalize auxiliary geometric constructions without pixel-level rendering or external executors. We design a three-stage curriculum that progressively aligns and internalizes these latent representations through auxiliary visual supervision, followed by LaGDPO, a latent-aware reinforcement learning procedure that stabilizes latent representations during policy optimization while improving end-task correctness. To systematically evaluate construction-centric representation quality, we introduce GeoAux, a new benchmark targeting visually dependent geometry problems, and conduct experiments on GeoAux and MathVerse. Results show that LatentGeo achieves substantial gains on geometric reasoning tasks, particularly those requiring auxiliary constructions. Extensive analyses and ablation studies further validate the effectiveness of each component in our framework.
Abstract（参考訳）: 近年の多モーダル推論の進歩にもかかわらず、補助的な幾何学的構成を表現することは、多モーダル大言語モデル(MLLM)の根本的な課題である。このような構成は元の図式から外れており、定理が適用される前に導入されなければならない。既存のアプローチは主に、テキストベースの幾何仕様、推論中の視覚的なインターリーブ、ツール拡張幾何実行など、明示的な構築パラダイムに依存している。しかし、これらの手法は複雑な空間関係を忠実に表現できないか、離散記号と連続幾何学構造の間の不正確な表現ミスマッチ、あるいはエンドツーエンドの最適化を妨げる外部能力に依存するかのいずれかである。これらの制約に対処するため,ピクセルレベルのレンダリングや外部エグゼキュータを使わずに,補助幾何学的構成を内部化するための連続的な潜時視覚表現を学習するフレームワークであるLatentGeoを提案する。そこで我々は,これら潜伏表現を段階的に調整・内部化する3段階のカリキュラムを設計し,それに続いて遅延対応強化学習手法であるLaGDPO(LaGDPO)を設計した。構成中心の表現品質を体系的に評価するために,視覚依存型幾何学問題を対象とした新しいベンチマークGeoAuxを導入し,GeoAuxとMathVerseの実験を行った。結果から,LatentGeoは幾何学的推論タスク,特に補助的な構成を必要とするタスクにおいて,かなりの向上を達成していることがわかった。大規模分析およびアブレーション研究により,本フレームワークにおける各コンポーネントの有効性がさらに検証された。

論文の概要: LatentGeo: Learnable Auxiliary Constructions in Latent Space for Multimodal Geometric Reasoning

関連論文リスト