Related papers: Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation

URL: http://arxiv.org/abs/2508.13587v1
Date: Tue, 19 Aug 2025 07:40:18 GMT
Title: Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
Authors: Lei Chen, Xuanle Zhao, Zhixiong Zeng, Jing Huang, Liming Zheng, Yufeng Zhong, Lin Ma,
Abstract summary: We propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation.<n>We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables.<n>MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively.
Score: 12.822184232115333
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While reinforcement learning (RL) has proven highly effective for general reasoning in vision-language models, its application to tasks requiring in-depth understanding of information-rich images and generation of structured outputs remains underexplored. Chart-to-code generation exemplifies this challenge, demanding complex reasoning over visual charts to generate structured code. Supervised fine-tuning (SFT) alone is often insufficient, highlighting the need for effective RL strategies that appropriately reward structured outputs. We systematically investigate the performance plateau in SFT through large-scale experiments and propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation, which substantially breaks through this plateau. We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables to mitigate simplistic patterns of prior synthetic data. Despite reaching state-of-the-art performance, our experiments show that scaling SFT data eventually hits a plateau where further increases yield negligible improvements. Our MSRL method leverages a multi-granularity structured reward system using multimodal textual and visual feedback. At the textual level, rule-based rewards validate fine-grained code details. At the visual level, model-based rewards assess structural similarity by rendering generated code into images and employing an evaluator model. We implement this within a two-stage curriculum for training stability. Results demonstrate that MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively, achieving competitive performance with advanced closed-source models.

Related papers

Chart Specification: Structural Representations for Incentivizing VLM Reasoning in Chart-to-Code Generation [11.18352269863283]
Vision-Language Models (VLMs) have shown promise in generating plotting code from chart images.<n>Existing approaches largely rely on supervised fine-tuning, encouraging surface-level token imitation.<n>We propose Chart Specification, a structured intermediate representation that shifts training from text imitation to semantically grounded supervision.
arXiv Detail & Related papers (2026-02-11T14:08:06Z)
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch [57.01439313241121]
We introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity.<n>We also develop truth-anchored inverse QA synthesis to guarantee reasoning rigor.<n>To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2026-01-20T05:11:44Z)
Aligning Text, Code, and Vision: A Multi-Objective Reinforcement Learning Framework for Text-to-Visualization [50.13408999553116]
We propose RL-Text2Vis, the first reinforcement learning framework for Text2Vis generation.<n>Our method uses a novel multi-objective reward that jointly optimize textual accuracy, code validity, and visualization quality.<n>Our results establish GRPO as an effective strategy for structured, multimodal reasoning in visualization generation.
arXiv Detail & Related papers (2026-01-08T04:29:07Z)
Revisiting the Data Sampling in Multimodal Post-training from a Difficulty-Distinguish View [10.95044674432639]
We propose two difficulty-aware sampling strategies for multimodal reasoning.<n>We show that Progressive Image Semantic Masking (PISM) quantifies sample hardness through systematic image degradation.<n>We also show that Cross-Modality Attention Balance (CMAB) assesses cross-modal interaction complexity.
arXiv Detail & Related papers (2025-11-10T05:31:59Z)
ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension [15.798942458550515]
This study proposes an automated multi-stage code-driven pipeline for generating visual reasoning datasets.<n>We construct ChartM$3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples.
arXiv Detail & Related papers (2025-11-04T09:45:34Z)
Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models [53.03670032402846]
We address the task of table image to code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs.<n>A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content.<n>We propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-La dataset.
arXiv Detail & Related papers (2025-09-22T11:13:48Z)
On the Generalization of SFT: A Reinforcement Learning Perspective with Reward Rectification [61.607788999847564]
We present a simple yet theoretically motivated improvement to Supervised Fine-Tuning (SFT) for the Large Language Model (LLM)<n>We reveal that standard SFT gradients implicitly encode a problematic reward structure that may severely restrict the generalization capabilities of model.<n>We propose Dynamic Fine-Tuning (DFT), stabilizing gradient updates for each token by dynamically rescaling the objective function with the probability of this token.
arXiv Detail & Related papers (2025-08-07T17:59:04Z)
Learning Efficient and Generalizable Graph Retriever for Knowledge-Graph Question Answering [75.12322966980003]
Large Language Models (LLMs) have shown strong inductive reasoning ability across various domains.<n>Most existing RAG pipelines rely on unstructured text, limiting interpretability and structured reasoning.<n>Recent studies have explored integrating knowledge graphs with LLMs for knowledge graph question answering.<n>We propose RAPL, a novel framework for efficient and effective graph retrieval in KGQA.
arXiv Detail & Related papers (2025-06-11T12:03:52Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement [16.22363384653305]
Multimodal Large Language Models (MLLMs) perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning.<n>We propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning.<n>Our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code.
arXiv Detail & Related papers (2025-04-03T07:51:20Z)
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
MTLSO: A Multi-Task Learning Approach for Logic Synthesis Optimization [19.13500546022262]
MTLSO is a Multi-Task Learning approach for Logic Synthesis Optimization. We introduce an auxiliary task of binary multi-label graph classification alongside the primary regression task. We also employ a hierarchical graph representation learning strategy to improve the model's capacity for learning expressive graph-level representations.
arXiv Detail & Related papers (2024-09-09T21:20:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.