Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
- URL: http://arxiv.org/abs/2405.07990v1
- Date: Mon, 13 May 2024 17:59:22 GMT
- Title: Plot2Code: A Comprehensive Benchmark for Evaluating Multi-modal Large Language Models in Code Generation from Scientific Plots
- Authors: Chengyue Wu, Yixiao Ge, Qiushan Guo, Jiahao Wang, Zhixuan Liang, Zeyu Lu, Ying Shan, Ping Luo,
- Abstract summary: We introduce Plot2Code, a comprehensive visual coding benchmark for Multi-modal Large Language Models.
We collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries.
For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4.
- Score: 66.95139377783966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable progress of Multi-modal Large Language Models (MLLMs) has attracted significant attention due to their superior performance in visual contexts. However, their capabilities in turning visual figure to executable code, have not been evaluated thoroughly. To address this, we introduce Plot2Code, a comprehensive visual coding benchmark designed for a fair and in-depth assessment of MLLMs. We carefully collect 132 manually selected high-quality matplotlib plots across six plot types from publicly available matplotlib galleries. For each plot, we carefully offer its source code, and an descriptive instruction summarized by GPT-4. This approach enables Plot2Code to extensively evaluate MLLMs' code capabilities across various input modalities. Furthermore, we propose three automatic evaluation metrics, including code pass rate, text-match ratio, and GPT-4V overall rating, for a fine-grained assessment of the output code and rendered images. Instead of simply judging pass or fail, we employ GPT-4V to make an overall judgement between the generated and reference images, which has been shown to be consistent with human evaluation. The evaluation results, which include analyses of 14 MLLMs such as the proprietary GPT-4V, Gemini-Pro, and the open-sourced Mini-Gemini, highlight the substantial challenges presented by Plot2Code. With Plot2Code, we reveal that most existing MLLMs struggle with visual coding for text-dense plots, heavily relying on textual instruction. We hope that the evaluation results from Plot2Code on visual coding will guide the future development of MLLMs. All data involved with Plot2Code are available at https://huggingface.co/datasets/TencentARC/Plot2Code.
Related papers
- PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback [47.79080056618323]
We propose PlotGen, a novel multi-agent framework aimed at the creation of precise scientific visualizations.
PlotGen orchestrates multiple.
retrieval agents, including a Query Planning Agent that breaks.
down complex user requests into executable code, and three.
retrieval feedback agents.
Experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatBench dataset.
arXiv Detail & Related papers (2025-02-03T02:00:29Z) - ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation [90.82566869965011]
textbfChartCoder is the first dedicated chart-to-code MLLM.
We introduce textbfChart2Code-160k, the first large-scale and diverse dataset for chart-to-code generation.
Experiments demonstrate that ChartCoder, with only 7B parameters, surpasses existing open-source MLLMs on chart-to-code benchmarks.
arXiv Detail & Related papers (2025-01-11T17:52:22Z) - Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code [1.5999407512883512]
This paper introduces the human-curated PandasPlotBench dataset.
It is designed to evaluate language models' effectiveness as assistants in visual data exploration.
arXiv Detail & Related papers (2024-12-03T19:05:37Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)
We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.
We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - SEED-Bench-2-Plus: Benchmarking Multimodal Large Language Models with Text-Rich Visual Comprehension [62.40482764691584]
We introduce SEED-Bench-2-Plus, a benchmark specifically designed for evaluating textbftext-rich visual comprehension of MLLMs.
Our benchmark comprises 2.3K multiple-choice questions with precise human annotations, spanning three broad categories: Charts, Maps, and Webs.
We conduct a thorough evaluation involving 34 prominent MLLMs and emphasize the current limitations of MLLMs in text-rich visual comprehension.
arXiv Detail & Related papers (2024-04-25T17:39:35Z) - Q-Bench+: A Benchmark for Multi-modal Foundation Models on Low-level Vision from Single Images to Pairs [71.07108539262721]
We design benchmark settings to emulate human language responses related to low-level vision.
We extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs.
We demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than humans.
arXiv Detail & Related papers (2024-02-11T06:44:11Z) - ChartBench: A Benchmark for Complex Visual Reasoning in Charts [36.492851648081405]
Multimodal Large Language Models (MLLMs) have shown impressive capabilities in image understanding and generation.
Current benchmarks fail to accurately evaluate the chart comprehension of MLLMs due to limited chart types and inappropriate metrics.
We propose ChartBench, a comprehensive benchmark designed to assess chart comprehension and data reliability through complex visual reasoning.
arXiv Detail & Related papers (2023-12-26T07:20:55Z) - SEED-Bench-2: Benchmarking Multimodal Large Language Models [67.28089415198338]
Multimodal large language models (MLLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs.
SEED-Bench-2 comprises 24K multiple-choice questions with accurate human annotations, which spans 27 dimensions.
We evaluate the performance of 23 prominent open-source MLLMs and summarize valuable observations.
arXiv Detail & Related papers (2023-11-28T05:53:55Z) - ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks.
To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.