Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
- URL: http://arxiv.org/abs/2506.14837v1
- Date: Sun, 15 Jun 2025 14:10:16 GMT
- Title: Improved Iterative Refinement for Chart-to-Code Generation via Structured Instruction
- Authors: Chengzhi Xu, Yuyang Wang, Lai Wei, Lichao Sun, Weiran Huang,
- Abstract summary: multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities.<n>This paper proposes ChartIR, an iterative refinement method based on structured instruction.<n> Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.
- Score: 13.728393452963942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, multimodal large language models (MLLMs) have attracted increasing research attention due to their powerful visual understanding capabilities. While they have achieved impressive results on various vision tasks, their performance on chart-to-code generation remains suboptimal. This task requires MLLMs to generate executable code that can reproduce a given chart, demanding not only precise visual understanding but also accurate translation of visual elements into structured code. Directly prompting MLLMs to perform this complex task often yields unsatisfactory results. To address this challenge, we propose {ChartIR}, an iterative refinement method based on structured instruction. First, we distinguish two tasks: visual understanding and code translation. To accomplish the visual understanding component, we design two types of structured instructions: description and difference. The description instruction captures the visual elements of the reference chart, while the difference instruction characterizes the discrepancies between the reference chart and the generated chart. These instructions effectively transform visual features into language representations, thereby facilitating the subsequent code translation process. Second, we decompose the overall chart generation pipeline into two stages: initial code generation and iterative refinement, enabling progressive enhancement of the final output. Experimental results show that, compared to other method, our method achieves superior performance on both the open-source model Qwen2-VL and the closed-source model GPT-4o.
Related papers
- Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning [16.22363384653305]
We introduce Chart2Code, a novel iterative dual preference learning framework for chart-to-code generation.<n>We find that Chart2Code consistently improves out-of-distribution chart-to-code generation quality.<n>Our framework paves the way for future advancements in chart comprehension.
arXiv Detail & Related papers (2025-04-03T07:51:20Z) - ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation [62.88742217569754]
textbfChartCoder is the first dedicated chart-to-code MLLM.<n>We introduce textbfChart2Code-160k, the first large-scale and diverse dataset for chart-to-code generation.<n> Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks.
arXiv Detail & Related papers (2025-01-11T17:52:22Z) - MetaMorph: Multimodal Understanding and Generation via Instruction Tuning [57.35160715164359]
Visual-Predictive Instruction Tuning (VPiT) is a simple and effective extension to visual instruction tuning.<n>VPiT teaches an LLM to predict discrete text tokens and continuous visual tokens from any input sequence of image and text data.<n>We train our MetaMorph model and achieve competitive performance on both visual understanding and generation.
arXiv Detail & Related papers (2024-12-18T18:58:50Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)
We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.
We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.<n>We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning [28.204261069650897]
We introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts.
In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results.
arXiv Detail & Related papers (2024-03-14T01:40:23Z) - ChartLlama: A Multimodal LLM for Chart Understanding and Generation [70.1393163657813]
We create a high-quality instruction-tuning dataset leveraging GPT-4.
Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset.
arXiv Detail & Related papers (2023-11-27T15:20:23Z) - Pink: Unveiling the Power of Referential Comprehension for Multi-modal
LLMs [49.88461345825586]
This paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs.
We present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets.
We show that our model exhibits a 5.2% accuracy improvement over Qwen-VL and surpasses the accuracy of Kosmos-2 by 24.7%.
arXiv Detail & Related papers (2023-10-01T05:53:15Z) - Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions [126.3136109870403]
We introduce a generic and lightweight Visual Prompt Generator Complete module (VPG-C)
VPG-C infers and completes the missing details essential for comprehending demonstrative instructions.
We build DEMON, a comprehensive benchmark for demonstrative instruction understanding.
arXiv Detail & Related papers (2023-08-08T09:32:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.