Fugu-MT 論文翻訳(概要): Generalizable Geometric Image Caption Synthesis

論文の概要: Generalizable Geometric Image Caption Synthesis

arxiv url: http://arxiv.org/abs/2509.15217v1
Date: Thu, 18 Sep 2025 17:59:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-19 17:26:53.393763
Title: Generalizable Geometric Image Caption Synthesis
Title（参考訳）: 一般化可能な幾何学的画像カプセル合成
Authors: Yue Xin, Wenyuan Wang, Rui Pan, Ruida Wang, Howard Meng, Renjie Pi, Shizhe Diao, Tong Zhang,
Abstract要約: 本稿ではデータ生成パイプラインにRLVR(Reinforcement Learning with Verifiable Rewards)を導入する。幾何学的画像のキャプションを改良するためにRLVRを採用することで、我々のパイプラインは幾何学的問題解決の重要な特徴を捉えた。アウト・オブ・ディストリビューションのシナリオであっても、生成されたデータセットは、マルチモーダルな大規模言語モデルの一般的な推論能力を高める。
参考スコア（独自算出の注目度）: 33.54322399613445
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal large language models have various practical applications that demand strong reasoning abilities. Despite recent advancements, these models still struggle to solve complex geometric problems. A key challenge stems from the lack of high-quality image-text pair datasets for understanding geometric images. Furthermore, most template-based data synthesis pipelines typically fail to generalize to questions beyond their predefined templates. In this paper, we bridge this gap by introducing a complementary process of Reinforcement Learning with Verifiable Rewards (RLVR) into the data generation pipeline. By adopting RLVR to refine captions for geometric images synthesized from 50 basic geometric relations and using reward signals derived from mathematical problem-solving tasks, our pipeline successfully captures the key features of geometry problem-solving. This enables better task generalization and yields non-trivial improvements. Furthermore, even in out-of-distribution scenarios, the generated dataset enhances the general reasoning capabilities of multimodal large language models, yielding accuracy improvements of $2.8\%\text{-}4.8\%$ in statistics, arithmetic, algebraic, and numerical tasks with non-geometric input images of MathVista and MathVerse, along with $2.4\%\text{-}3.9\%$ improvements in Art, Design, Tech, and Engineering tasks in MMMU.
Abstract（参考訳）: マルチモーダルな大言語モデルは、強力な推論能力を必要とする様々な実践的応用を持っている。近年の進歩にもかかわらず、これらのモデルは複雑な幾何学的問題を解くのに苦戦している。鍵となる課題は、幾何学的画像を理解するための高品質な画像テキストペアデータセットが欠如していることにある。さらに、ほとんどのテンプレートベースのデータ合成パイプラインは、定義済みのテンプレートを超えた質問に一般化することができない。本稿では、データ生成パイプラインにReinforcement Learning with Verifiable Rewards(RLVR)の補完的なプロセスを導入することにより、このギャップを埋める。 RLVRを用いて、50の基本的な幾何学的関係から合成された幾何学的画像のキャプションを洗練させ、数学的問題解決タスクから得られる報酬信号を用いて、我々のパイプラインは幾何学的問題解決の重要な特徴を捉えた。これによりタスクの一般化が向上し、非自明な改善がもたらされる。さらに、アウト・オブ・ディストリビューションのシナリオにおいても、生成されたデータセットはマルチモーダルな大規模言語モデルの一般的な推論能力を強化し、統計、算術、代数、数値タスクにおいて2.8\%\text{-}4.8\%の精度向上、MathVistaとMathVerseの非幾何学的な入力イメージでの2.4\%\text{-}3.9\%の精度向上、MMMUのアート、デザイン、技術、エンジニアリングタスクにおける2.4\%のコスト向上を実現している。

論文の概要: Generalizable Geometric Image Caption Synthesis

関連論文リスト