Fugu-MT 論文翻訳(概要): CapGeo: A Caption-Assisted Approach to Geometric Reasoning

論文の概要: CapGeo: A Caption-Assisted Approach to Geometric Reasoning

arxiv url: http://arxiv.org/abs/2510.09302v1
Date: Fri, 10 Oct 2025 11:47:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 00:38:48.88582
Title: CapGeo: A Caption-Assisted Approach to Geometric Reasoning
Title（参考訳）: CapGeo: 幾何学的推論のためのキャプション支援アプローチ
Authors: Yuying Li, Siyi Qian, Hao Liang, Leqi Zheng, Ruichuan An, Yongzhen Guo, Wentao Zhang,
Abstract要約: CapGeoは、視覚とテキストのモダリティを橋渡しするキャプション支援推論フレームワークである。モデルがキャプションを装備している場合、実験は大幅に改善される。また,4,641個のフィギュアキャプションペアのデータセットであるCapGeo-Benchを提案する。
参考スコア（独自算出の注目度）: 10.716955074782902
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Geometric reasoning remains a core challenge for Multimodal Large Language Models (MLLMs). Even the most advanced closed-source systems, such as GPT-O3 and Gemini-2.5-Pro, still struggle to solve geometry problems reliably, despite exhibiting strong textual reasoning abilities on tasks like the International Mathematical Olympiad (IMO). This gap suggests that the bottleneck lies in understanding geometric diagrams rather than reasoning itself. Since geometric figures can often be faithfully described in concise textual form, converting visual content into captions offers a promising direction. Motivated by this insight, we introduce CapGeo, a caption-assisted reasoning framework that bridges visual and textual modalities. Experiments show substantial improvements when models are equipped with captions: Qwen2.5-VL-72B improves from 8.6% (vision-only) to 59.0%, while Claude-Opus-4 rises from 44.8% to 73.0%. To systematically evaluate and identify high-quality geometric captioning models, we further propose CapGeo-Bench, a dataset of 4,641 curated figure-caption pairs. Crucially, CapGeo-Bench incorporates a keypoint-based evaluation metric that correlates strongly with downstream CapGeo performance, enabling reliable assessment of geometric captioning ability. Together, our framework and benchmark highlight a new pathway toward advancing geometric reasoning in MLLMs.
Abstract（参考訳）: 幾何学的推論は、MLLM(Multimodal Large Language Models)のコア課題であり続けている。 GPT-O3やGemini-2.5-Proのような最も先進的なクローズドソースシステムでさえ、国際数学オリンピック (IMO) のようなタスクに強いテキスト推論能力を示すにもかかわらず、幾何問題を確実に解くのに苦戦している。このギャップは、ボトルネックがそれ自体を推論するのではなく、幾何学的図形を理解することにあることを示唆している。幾何学的図形はしばしば簡潔なテキスト形式で忠実に記述されるので、視覚的コンテンツをキャプションに変換することは有望な方向を提供する。この知見に触発されたCapGeoは、視覚とテキストのモダリティを橋渡しするキャプション支援推論フレームワークである。 Qwen2.5-VL-72Bは8.6%(ビジョンのみ)から59.0%、Claude-Opus-4は44.8%から73.0%に改善されている。高品質な幾何学的キャプションモデルを体系的に評価し,同定するために,4,641個のキュレートされたフィギュアキャプションペアのデータセットであるCapGeo-Benchを提案する。重要な点として、CapGeo-Benchは、下流CapGeoのパフォーマンスと強く相関し、幾何学的キャプション能力の信頼性評価を可能にするキーポイントベースの評価指標を組み込んでいる。筆者らのフレームワークとベンチマークは,MLLMにおける幾何学的推論の進展に向けた新たな道筋を浮き彫りにしている。

論文の概要: CapGeo: A Caption-Assisted Approach to Geometric Reasoning

関連論文リスト