Related papers: Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text

URL: http://arxiv.org/abs/2507.19969v1
Date: Sat, 26 Jul 2025 14:59:04 GMT
Title: Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text
Authors: Mizanur Rahman, Md Tahmid Rahman Laskar, Shafiq Joty, Enamul Hoque,
Abstract summary: We introduce Text2Vis, a benchmark designed to assess text-to-visualization models.<n>It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts.<n>It reveals significant performance gaps, highlighting key challenges, and offering insights for future advancements.
Score: 30.74255946385862
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Automated data visualization plays a crucial role in simplifying data interpretation, enhancing decision-making, and improving efficiency. While large language models (LLMs) have shown promise in generating visualizations from natural language, the absence of comprehensive benchmarks limits the rigorous evaluation of their capabilities. We introduce Text2Vis, a benchmark designed to assess text-to-visualization models, covering 20+ chart types and diverse data science queries, including trend analysis, correlation, outlier detection, and predictive analytics. It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts. The queries involve complex reasoning, conversational turns, and dynamic data retrieval. We benchmark 11 open-source and closed-source models, revealing significant performance gaps, highlighting key challenges, and offering insights for future advancements. To close this gap, we propose the first cross-modal actor-critic agentic framework that jointly refines the textual answer and visualization code, increasing GPT-4o`s pass rate from 26% to 42% over the direct approach and improving chart quality. We also introduce an automated LLM-based evaluation framework that enables scalable assessment across thousands of samples without human annotation, measuring answer correctness, code execution success, visualization readability, and chart accuracy. We release Text2Vis at https://github.com/vis-nlp/Text2Vis.

Related papers

POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering [69.52231076699756]
PolyChartQA is the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages.<n>We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts.
arXiv Detail & Related papers (2025-07-16T06:09:02Z)
Text2Insight: Transform natural language text into insights seamlessly using multi-model architecture [0.0]
Text2Insight is an innovative solution that delivers customized data analysis and visualizations based on user-defined natural language requirements.<n>To enhance analysis capabilities, the system integrates a question-answering model and a predictive model using the BERT framework.<n>Performance evaluation of Text2Insight demonstrates its effectiveness, achieving high accuracy (99%), precision (100%), recall (99%), and F1-score (99%), with a BLEU score of 0.5.
arXiv Detail & Related papers (2024-12-27T16:17:22Z)
Distill Visual Chart Reasoning Ability from LLMs to MLLMs [38.62832112530892]
Solving complex chart Q&A tasks requires advanced visual reasoning abilities in multimodal large language models (MLLMs) We propose Code-as-Intermediary Translation (CIT), a cost-effective, efficient and easily scalable data synthesis method for distilling visual reasoning abilities from LLMs to MLLMs. We employ text-based synthesizing techniques to construct chart-plotting code and produce ReachQA, a dataset containing 3k reasoning-intensive charts and 20k Q&A pairs.
arXiv Detail & Related papers (2024-10-24T14:50:42Z)
ChartifyText: Automated Chart Generation from Data-Involved Texts via LLM [16.87320295911898]
Text documents with numerical values involved are widely used in various applications such as scientific research, economy, public health and journalism. To fill this research gap, this work aims to automatically generate charts to accurately convey the underlying data and ideas to readers. We propose ChartifyText, a novel fully-automated approach that leverages Large Language Models (LLMs) to convert complex data-involved texts to expressive charts.
arXiv Detail & Related papers (2024-10-18T09:43:30Z)
PUB: Plot Understanding Benchmark and Dataset for Evaluating Large Language Models on Synthetic Visual Data Interpretation [2.1184929769291294]
This paper presents a novel synthetic dataset designed to evaluate the proficiency of large language models in interpreting data visualizations. Our dataset is generated using controlled parameters to ensure comprehensive coverage of potential real-world scenarios. We employ multimodal text prompts with questions related to visual data in images to benchmark several state-of-the-art models.
arXiv Detail & Related papers (2024-09-04T11:19:17Z)
VisEval: A Benchmark for Data Visualization in the Era of Large Language Models [12.077276008688065]
Recent advancements in pre-trained large language models (LLMs) are opening new avenues for generating visualizations from natural language. In this paper, we propose a new NL2VIS benchmark called VisEval. This dataset includes 2,524 representative queries covering 146 databases, paired with accurately labeled ground truths.
arXiv Detail & Related papers (2024-07-01T05:35:30Z)
CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs [62.84082370758761]
CharXiv is a comprehensive evaluation suite involving 2,323 charts from arXiv papers. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model.
arXiv Detail & Related papers (2024-06-26T17:50:11Z)
TextSquare: Scaling up Text-Centric Visual Instruction Tuning [62.878378882175284]
We introduce a new approach for creating a massive, high-quality instruction-tuning dataset, Square-10M.<n>Our model, TextSquare, considerably surpasses open-source previous state-of-the-art Text-centric MLLMs.<n>It even outperforms top-tier models like GPT4V and Gemini in 6 of 10 text-centric benchmarks.
arXiv Detail & Related papers (2024-04-19T11:38:08Z)
ConTextual: Evaluating Context-Sensitive Text-Rich Visual Reasoning in Large Multimodal Models [92.60282074937305]
We introduce ConTextual, a novel dataset featuring human-crafted instructions that require context-sensitive reasoning for text-rich images. We conduct experiments to assess the performance of 14 foundation models and establish a human performance baseline. We observe a significant performance gap of 30.8% between GPT-4V and human performance.
arXiv Detail & Related papers (2024-01-24T09:07:11Z)
Contrastive Transformer Learning with Proximity Data Generation for Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery. Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data. In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.