Fugu-MT 論文翻訳(概要): Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

論文の概要: Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

arxiv url: http://arxiv.org/abs/2508.13587v1
Date: Tue, 19 Aug 2025 07:40:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-20 15:36:31.832625
Title: Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
Title（参考訳）: SFT台地を打破する:チャート・ツー・コード生成のためのマルチモーダル構造化強化学習
Authors: Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma,
Abstract要約: 本稿では,マルチモーダル構造化強化学習(MSRL)を提案する。実世界のarXivテーブルから300万のチャートコードペアを含む,これまでで最大のトレーニングコーパスを構築した。 MSRLはSFT高原を著しく破壊し、ChartMimicとReachQAのベンチマークでそれぞれ6.2%と9.9%の高水準のメトリクスを改善した。
参考スコア（独自算出の注目度）: 12.822184232115333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.
Abstract（参考訳）: 強化学習(RL)は、視覚言語モデルにおける一般的な推論に非常に有効であることが証明されているが、情報豊富な画像の深い理解と構造化出力の生成を必要とするタスクへの応用は、まだ未定である。 Chart-to-code生成はこの課題を例示し、構造化コードを生成するためにビジュアルチャートに複雑な推論を要求する。 Supervised Fine-tuning (SFT) だけでは不十分であり、構造化された出力を適切に報酬する効果的なRL戦略の必要性を強調している。大規模実験によりSFTの性能指標を体系的に検討し,マルチモーダル構造強化学習(Multimodal Structured Reinforcement Learning, MSRL)を提案する。これまでで最大のトレーニングコーパスを構築し、実世界のarXivテーブルから300万のチャートコードペアを格納し、より単純な合成データのパターンを緩和する。我々の実験では、最先端のパフォーマンスに到達したにもかかわらず、SFTデータのスケーリングは最終的に高原に到達し、さらなる改善が期待できない結果となった。 MSRL法は,マルチモーダルテキストと視覚フィードバックを用いたマルチグラニュラリティ構造化報酬システムを利用する。テキストレベルでは、ルールベースの報酬は、きめ細かいコードの詳細を検証する。視覚レベルでは、モデルベースの報酬は、生成されたコードを画像にレンダリングし、評価モデルを用いて構造的類似性を評価する。トレーニング安定のための2段階のカリキュラムでこれを実装します。 MSRLはSFT高原を著しく破壊し、ChartMimicとReachQAのベンチマークでそれぞれ6.2%と9.9%改善し、高度なクローズドソースモデルと競合する性能を達成した。

論文の概要: Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

関連論文リスト