Related papers: ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension

URL: http://arxiv.org/abs/2511.02415v1
Date: Tue, 04 Nov 2025 09:45:34 GMT
Title: ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension
Authors: Duo Xu, Hao Cheng, Xin Lin, Zhen Xie, Hao Wang,
Abstract summary: This study proposes an automated multi-stage code-driven pipeline for generating visual reasoning datasets.<n>We construct ChartM$3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples.
Score: 15.798942458550515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Complex chart understanding tasks demand advanced visual recognition and reasoning capabilities from multimodal large language models (MLLMs). However, current research provides limited coverage of complex chart scenarios and computation-intensive reasoning tasks prevalent in real-world applications. This study proposes an automated multi-stage code-driven pipeline for systematically generating visual reasoning datasets to address these limitations. The pipeline integrates retrieval-augmented generation (RAG) to retrieve professional chart templates and employs chain-of-thought (CoT) strategies to generate reasoning codes that simulate real data distributions, thereby driving chart rendering and question-related statistical computations. Through model-based evaluation, the pipeline enhances chart diversity and data quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and multi-step dataset containing 38K charts and 142K Q&A pairs for training, along with 2,871 high-quality evaluation samples for enabling practical performance assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL) experiments demonstrate that our dataset significantly improves reasoning capabilities and cross-domain generalization performance, enabling smaller models to achieve performance comparable to larger-scale models in complex chart comprehension.

Related papers

ShowTable: Unlocking Creative Table Visualization with Collaborative Reflection and Refinement [58.957050610762565]
ShowTable is a pipeline that synergizes MLLMs with diffusion models via a progressive self-correcting process.<n> MLLM acts as the central orchestrator for reasoning the visual plan and judging visual errors.<n>We introduce TableVisBench, a new benchmark with 800 challenging instances across 5 evaluation dimensions.
arXiv Detail & Related papers (2025-12-15T13:21:50Z)
Chart2Code-MoLA: Efficient Multi-Modal Code Generation via Adaptive Expert Routing [20.521717930460692]
C2C-MoLA is a framework that synergizes Mixture of Experts (MoE) with Low-Rank Adaptation (LoRA)<n>LoRA enables parameter-efficient updates for resource-conscious tuning.<n>Experiments on Chart2Code-160k show that the proposed model improves generation accuracy by up to 17%.
arXiv Detail & Related papers (2025-11-28T16:23:04Z)
Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z)
Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph [42.247964605609745]
Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference.<n>We formalize it as a multi-LLM collaboration graph, where nodes encode roles and model assignments, edges capture information flow.<n>We propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update.
arXiv Detail & Related papers (2025-10-29T22:14:25Z)
PlotCraft: Pushing the Limits of LLMs for Complex and Interactive Data Visualization [82.96200364977737]
We introduce PlotCraft, a new benchmark featuring 1k challenging visualization tasks.<n>PlotCraft is structured around seven high-level visualization tasks and encompasses 48 distinct chart types.<n>It is the first to systematically evaluate both single-turn generation and multi-turn refinement across a diverse spectrum of task complexities.
arXiv Detail & Related papers (2025-10-15T10:14:39Z)
Jupiter: Enhancing LLM Data Analysis Capabilities via Notebook and Inference-Time Value-Guided Search [37.53003959273494]
We propose a scalable pipeline that extracts high-quality, tool-based data analysis tasks and their executable multi-step solutions from real-world Jupyter notebooks.<n>Using this pipeline, we introduce NbQA, a large-scale dataset of standardized task-solution pairs.<n>We also present Jupiter, a framework that formulates data analysis as a search problem and applies Monte Carlo Tree Search.
arXiv Detail & Related papers (2025-09-11T08:27:54Z)
Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation [12.822184232115333]
We propose Multimodal Structured Reinforcement Learning (MSRL) for chart-to-code generation.<n>We construct the largest training corpus to date, containing 3 million chart-code pairs from real-world arXiv tables.<n>MSRL significantly breaks the SFT plateau, improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA benchmarks respectively.
arXiv Detail & Related papers (2025-08-19T07:40:18Z)
BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning [51.472854950300416]
We propose BigCharts, a dataset creation pipeline that generates visually diverse chart images.<n>Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity.<n>By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization.
arXiv Detail & Related papers (2025-08-13T13:39:17Z)
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering [12.285453136336507]
We propose a code-driven framework designed to enable precise, interpretable reasoning over charts.<n>We first train a high-fidelity model to convert diverse chart images into structured ECharts codes.<n>Then, we design a general chart reasoning data synthesis pipeline.<n>Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-06-11T18:55:36Z)
ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning [28.204261069650897]
We introduce ChartInstruct: a novel chart-specific vision-language Instruction-following dataset comprising 191K instructions generated with 71K charts. In experiments on four downstream tasks, we first show the effectiveness of our model--achieving a new set of state-of-the-art results.
arXiv Detail & Related papers (2024-03-14T01:40:23Z)
MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts. We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types. We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
Diffusion Model is an Effective Planner and Data Synthesizer for Multi-Task Reinforcement Learning [101.66860222415512]
Multi-Task Diffusion Model (textscMTDiff) is a diffusion-based method that incorporates Transformer backbones and prompt learning for generative planning and data synthesis. For generative planning, we find textscMTDiff outperforms state-of-the-art algorithms across 50 tasks on Meta-World and 8 maps on Maze2D.
arXiv Detail & Related papers (2023-05-29T05:20:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.