Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
- URL: http://arxiv.org/abs/2508.13587v1
- Date: Tue, 19 Aug 2025 07:40:18 GMT
- Title: Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
- Authors: Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma,
- Abstract summary: We propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation.<n>We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables.<n>MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively.
- Score: 12.822184232115333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.
Related papers
- Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation [11.18352269863283]
Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images.<n>Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation.<n>We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision.
arXiv Detail & Related papers (2026-02-11T14:08:06Z) - ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch [57.01439313241121]
We introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity.<n>We also develop truth-anchored inverse QA synthesis to guarantee reasoning rigor.<n>To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2026-01-20T05:11:44Z) - Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization [50.13408999553116]
We propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation.<n>Our method uses a novel multi-objective reward that jointly optimize textual accuracy, code validity, and visualization quality.<n>Our results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation.
arXiv Detail & Related papers (2026-01-08T04:29:07Z) - Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View [10.95044674432639]
We propose two difficulty-aware sampling strategies for multimodal reasoning.<n>We show that Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation.<n>We also show that Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity.
arXiv Detail & Related papers (2025-11-10T05:31:59Z) - ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension [15.798942458550515]
This study proposes an automated multi-stage code-driven pipeline for generating visual reasoning datasets.<n>We construct ChartM$3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples.
arXiv Detail & Related papers (2025-11-04T09:45:34Z) - Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models [53.03670032402846]
We address the task of table image to code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs.<n>A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content.<n>We propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-La dataset.
arXiv Detail & Related papers (2025-09-22T11:13:48Z) - On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z) - Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z) - Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z) - Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement [16.22363384653305]
Multimodal Large Language Models (MLLMs) perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning.<n>We propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning.<n>Our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code.
arXiv Detail & Related papers (2025-04-03T07:51:20Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - MTLSO: A Multi-Task Learning Approach for Logic Synthesis Optimization [19.13500546022262]
MTLSO is a Multi-Task Learning approach for Logic Synthesis Optimization.
We introduce an auxiliary task of binary multi-label graph classification alongside the primary regression task.
We also employ a hierarchical graph representation learning strategy to improve the model's capacity for learning expressive graph-level representations.
arXiv Detail & Related papers (2024-09-09T21:20:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.