METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
- URL: http://arxiv.org/abs/2502.17651v3
- Date: Thu, 06 Mar 2025 00:45:00 GMT
- Title: METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling
- Authors: Bingxuan Li, Yiwei Wang, Jiuxiang Gu, Kai-Wei Chang, Nanyun Peng,
- Abstract summary: We build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation.<n>We propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents.
- Score: 100.33658998796064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Chart generation aims to generate code to produce charts satisfying the desired visual properties, e.g., texts, layout, color, and type. It has great potential to empower the automatic professional report generation in financial analysis, research presentation, education, and healthcare. In this work, we build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation. Generating high-quality charts requires both strong visual design skills and precise coding capabilities that embed the desired visual properties into code. Such a complex multi-modal reasoning process is difficult for direct prompting of VLMs. To resolve these challenges, we propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents. METAL achieves 5.2% improvement over the current best result in the chart generation task. The METAL framework exhibits the phenomenon of test-time scaling: its performance increases monotonically as the logarithmic computational budget grows from 512 to 8192 tokens. In addition, we find that separating different modalities during the critique process of METAL boosts the self-correction capability of VLMs in the multimodal context.
Related papers
- Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding [14.75820681491341]
Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.
We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.
Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
arXiv Detail & Related papers (2025-04-14T00:07:39Z) - Enhancing Chart-to-Code Generation in Multimodal Large Language Models via Iterative Dual Preference Learning [16.22363384653305]
We introduce Chart2Code, a novel iterative dual preference learning framework for chart-to-code generation.
We find that Chart2Code consistently improves out-of-distribution chart-to-code generation quality.
Our framework paves the way for future advancements in chart comprehension.
arXiv Detail & Related papers (2025-04-03T07:51:20Z) - Towards Understanding Graphical Perception in Large Multimodal Models [80.44471730672801]
We leverage the theory of graphical perception to develop an evaluation framework for analyzing gaps in LMMs' perception abilities in charts.
We apply our framework to evaluate and diagnose the perception capabilities of state-of-the-art LMMs at three levels (chart, visual element, and pixel)
arXiv Detail & Related papers (2025-03-13T20:13:39Z) - Dual-level Mixup for Graph Few-shot Learning with Fewer Tasks [23.07584018576066]
We propose a SiMple yet effectIve approach for graph few-shot Learning with fEwer tasks, named SMILE.<n>We introduce a dual-level mixup strategy, encompassing both within-task and across-task mixup, to simultaneously enrich the available nodes and tasks in meta-learning.<n> Empirically, SMILE consistently outperforms other competitive models by a large margin across all evaluated datasets with in-domain and cross-domain settings.
arXiv Detail & Related papers (2025-02-19T23:59:05Z) - PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback [47.79080056618323]
We propose PlotGen, a novel multi-agent framework aimed at the creation of precise scientific visualizations.<n>PlotGen orchestrates multiple.<n>retrieval agents, including a Query Planning Agent that breaks.<n>down complex user requests into executable code, and three.<n>retrieval feedback agents.<n>Experiments show that PlotGen outperforms strong baselines, achieving a 4-6 percent improvement on the MatBench dataset.
arXiv Detail & Related papers (2025-02-03T02:00:29Z) - Multimodal Graph Constrastive Learning and Prompt for ChartQA [11.828192162922436]
ChartQA presents significant challenges due to the complex distribution of chart elements and the implicit patterns embedded within the underlying data.<n>We have developed a joint multimodal scene graph for charts, explicitly representing the relationships between chart elements and their associated patterns.
arXiv Detail & Related papers (2025-01-08T06:27:07Z) - Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs)
We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs.
We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z) - MuseGraph: Graph-oriented Instruction Tuning of Large Language Models
for Generic Graph Mining [41.19687587548107]
Graph Neural Networks (GNNs) need to be re-trained every time when applied to different graph tasks and datasets.
We propose a novel framework MuseGraph, which seamlessly integrates the strengths of GNNs and Large Language Models (LLMs)
Our experimental results demonstrate significant improvements in different graph tasks.
arXiv Detail & Related papers (2024-03-02T09:27:32Z) - ChartLlama: A Multimodal LLM for Chart Understanding and Generation [70.1393163657813]
We create a high-quality instruction-tuning dataset leveraging GPT-4.
Next, we introduce ChartLlama, a multi-modal large language model that we've trained using our created dataset.
arXiv Detail & Related papers (2023-11-27T15:20:23Z) - MMC: Advancing Multimodal Chart Understanding with Large-scale Instruction Tuning [48.63002688222462]
A gap remains in the domain of chart image understanding due to the distinct abstract components in charts.
We introduce a large-scale MultiModal Chart Instruction dataset comprising 600k instances supporting diverse tasks and chart types.
We develop MultiModal Chart Assistant (textbfMMC-A), an LMM that achieves state-of-the-art performance on existing chart QA benchmarks.
arXiv Detail & Related papers (2023-11-15T23:36:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.