Related papers: Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement

Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement

URL: http://arxiv.org/abs/2504.02906v2
Date: Wed, 20 Aug 2025 14:56:28 GMT
Title: Boosting Chart-to-Code Generation in MLLM via Dual Preference-Guided Refinement
Authors: Zhihan Zhang, Yixin Cao, Lizi Liao,
Abstract summary: Multimodal Large Language Models (MLLMs) perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning.<n>We propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning.<n>Our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code.
Score: 16.22363384653305
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Translating chart images into executable plotting scripts-referred to as the chart-to-code generation task-requires Multimodal Large Language Models (MLLMs) to perform fine-grained visual parsing, precise code synthesis, and robust cross-modal reasoning. However, this task is inherently under-constrained: multiple valid code implementations can produce the same visual chart, and evaluation must consider both code correctness and visual fidelity across diverse dimensions. This makes it difficult to learn accurate and generalizable mappings through standard supervised fine-tuning. To address these challenges, we propose a dual preference-guided refinement framework that combines a feedback-driven, dual-modality reward mechanism with iterative preference learning. Our approach introduces a structured variant generation strategy and a visual reward model to efficiently produce high-quality, aspect-aware preference pairs-making preference collection scalable and supervision more targeted. These preferences are used in an offline reinforcement learning setup to optimize the model toward multi-dimensional fidelity. Experimental results show that our framework significantly enhances the performance of general-purpose open-source MLLMs, enabling them to generate high-quality plotting code that rivals specialized chart-centric models and even some proprietary systems. The code and datasets are publicly available at https://github.com/Zhihan72/Chart2Code.

Related papers

From One-to-One to Many-to-Many: Dynamic Cross-Layer Injection for Deep Vision-Language Fusion [91.35078719566472]
Vision-Language Models (VLMs) create a severe visual feature bottleneck by using a crude, asymmetric connection.<n>We introduce Cross-Layer Injection (CLI), a novel and lightweight framework that forges a dynamic many-to-many bridge between the two modalities.
arXiv Detail & Related papers (2026-01-15T18:59:10Z)
Chart2Code-MoLA: Efficient Multi-Modal Code Generation via Adaptive Expert Routing [20.521717930460692]
C2C-MoLA is a framework that synergizes Mixture of Experts (MoE) with Low-Rank Adaptation (LoRA)<n>LoRA enables parameter-efficient updates for resource-conscious tuning.<n>Experiments on Chart2Code-160k show that the proposed model improves generation accuracy by up to 17%.
arXiv Detail & Related papers (2025-11-28T16:23:04Z)
VinciCoder: Unifying Multimodal Code Generation via Coarse-to-fine Visual Reinforcement Learning [13.193184888476404]
We introduce textbfciCoder, a unified multimodal code generation model.<n>We begin by constructing a large-scale Supervised Finetuning (SFT) corpus comprising 1.6M image-code pairs.<n>We then introduce a Visual Reinforcement Learning (ViRL) strategy, which employs a coarse-to-fine reward mechanism to improve visual fidelity.
arXiv Detail & Related papers (2025-11-01T04:05:26Z)
Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models [53.03670032402846]
We address the task of table image to code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs.<n>A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content.<n>We propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-La dataset.
arXiv Detail & Related papers (2025-09-22T11:13:48Z)
Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation [12.822184232115333]
We propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation.<n>We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables.<n>MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively.
arXiv Detail & Related papers (2025-08-19T07:40:18Z)
VisCodex: Unified Multimodal Code Generation via Merging Vision and Coding Models [82.05514464090172]
Multimodal large language models (MLLMs) have significantly advanced the integration of visual and textual understanding.<n>However, their ability to generate code from multimodal inputs remains limited.<n>We introduce VisCodex, a unified framework that seamlessly merges vision and coding language models.
arXiv Detail & Related papers (2025-08-13T17:00:44Z)
ScreenCoder: Advancing Visual-to-Code Generation for Front-End Automation via Modular Multimodal Agents [40.697759330690815]
ScreenCoder is a modular multi-agent framework that decomposes the task into three interpretable stages: grounding, planning, and generation.<n>By assigning these distinct responsibilities to specialized agents, our framework achieves significantly higher robustness and fidelity than end-to-end approaches.<n>Our approach achieves state-of-the-art performance in layout accuracy, structural coherence, and code correctness.
arXiv Detail & Related papers (2025-07-30T16:41:21Z)
Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction [13.728393452963942]
multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities.<n>This paper proposes ChartIR, an iterative refinement method based on structured instruction.<n> Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.
arXiv Detail & Related papers (2025-06-15T14:10:16Z)
LayoutCoT: Unleashing the Deep Reasoning Potential of Large Language Models for Layout Generation [3.1627400208503653]
Conditional layout generation aims to automatically generate visually appealing and semantically coherent layouts from user-defined constraints.<n>We propose a novel approach that leverages the reasoning capabilities of Large Language Models (LLMs) through a combination of Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) techniques.<n>We conduct extensive experiments on five public datasets spanning three conditional layout generation tasks.
arXiv Detail & Related papers (2025-04-15T03:12:01Z)
Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z)
TabGLM: Tabular Graph Language Model for Learning Transferable Representations Through Multi-Modal Consistency Minimization [2.1067477213933503]
TabGLM (Tabular Graph Language Model) is a novel multi-modal architecture designed to model both structural and semantic information from a table.<n>It transforms each row of a table into a fully connected graph and serialized text, which are encoded using a graph neural network (GNN) and a text encoder, respectively.<n> Evaluations across 25 benchmark datasets demonstrate substantial performance gains.
arXiv Detail & Related papers (2025-02-26T05:32:45Z)
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling [100.33658998796064]
We build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation.<n>We propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents.
arXiv Detail & Related papers (2025-02-24T21:01:39Z)
ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation [90.82566869965011]
textbfChartCoder is the first dedicated chart-to-code MLLM.<n>We introduce textbfChart2Code-160k, the first large-scale and diverse dataset for chart-to-code generation.<n> Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks.
arXiv Detail & Related papers (2025-01-11T17:52:22Z)
Multimodal Graph Constrastive Learning and Prompt for ChartQA [11.828192162922436]
ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data.<n>We have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns.
arXiv Detail & Related papers (2025-01-08T06:27:07Z)
Improving Autoregressive Visual Generation with Cluster-Oriented Token Prediction [52.09472099976885]
IAR is an Improved AutoRegressive Visual Generation Method that enhances the training efficiency and generation quality of LLM-based visual generation models.<n>Our method consistently enhances the model training efficiency and performance from 100M to 1.4B, reducing the training time by half while achieving the same FID.
arXiv Detail & Related papers (2025-01-01T15:58:51Z)
ChartAdapter: Large Vision-Language Model for Chart Summarization [13.499376163294816]
ChartAdapter is a lightweight transformer module designed to bridge the gap between charts and textual summaries.<n>By integrating ChartAdapter with an LLM, we enable end-to-end training and efficient chart summarization.
arXiv Detail & Related papers (2024-12-30T05:07:34Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts. We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.58521787193293]
We present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module.
arXiv Detail & Related papers (2024-04-25T14:23:24Z)
ChartLlama: A Multimodal LLM for Chart Understanding and Generation [70.1393163657813]
We create a high-quality instruction-tuning dataset leveraging GPT-4. Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset.
arXiv Detail & Related papers (2023-11-27T15:20:23Z)
DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning [62.51232333352754]
Text-to-image (T2I) generation has seen significant growth over the past few years. Despite this, there has been little work on generating diagrams with T2I models. We present DiagrammerGPT, a novel two-stage text-to-diagram generation framework. We show that our framework produces more accurate diagrams, outperforming existing T2I models.
arXiv Detail & Related papers (2023-10-18T17:37:10Z)
Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.