POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
- URL: http://arxiv.org/abs/2507.11939v1
- Date: Wed, 16 Jul 2025 06:09:02 GMT
- Title: POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
- Authors: Yichen Xu, Liangyu Chen, Liang Zhang, Wenxuan Wang, Qin Jin,
- Abstract summary: PolyChartQA is the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages.<n>We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts.
- Score: 69.52231076699756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Charts are a universally adopted medium for interpreting and communicating data. However, existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. In this paper, we present PolyChartQA, the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages. PolyChartQA is built using a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be flexibly generated by simply translating the data and reusing the code. We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts. PolyChartQA facilitates systematic evaluation of multilingual chart understanding. Experiments on both open- and closed-source large vision-language models reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. This benchmark lays a foundation for advancing globally inclusive vision-language models.
Related papers
- Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text [30.74255946385862]
We introduce Text2Vis, a benchmark designed to assess text-to-visualization models.<n>It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts.<n>It reveals significant performance gaps, highlighting key challenges, and offering insights for future advancements.
arXiv Detail & Related papers (2025-07-26T14:59:04Z) - In-Depth and In-Breadth: Pre-training Multimodal Language Models Customized for Comprehensive Chart Understanding [113.17601814293722]
We introduce ChartScope, an LVLM optimized for in-depth chart comprehension across diverse chart types.<n>We propose an efficient data generation pipeline that synthesizes paired data for a wide range of chart types.<n>We also establish ChartDQA, a new benchmark for evaluating not only question-answering at different levels but also underlying data understanding.
arXiv Detail & Related papers (2025-07-18T18:15:09Z) - Florenz: Scaling Laws for Systematic Generalization in Vision-Language Models [17.444066202370397]
Cross-lingual transfer enables vision-language models to perform vision tasks in various languages with training data only in one language.<n>Current approaches rely on large pre-trained multilingual language models.<n>We propose Florenz, a monolingual encoder-decoder VLM with 0.4B to 11.2B parameters combining the pre-trained VLM Florence-2 and the large language model Gemma-2.
arXiv Detail & Related papers (2025-03-12T14:41:10Z) - MultiChartQA: Benchmarking Vision-Language Models on Multi-Chart Problems [18.188725200923333]
Existing benchmarks for chart-related tasks fall short in capturing the complexity of real-world multi-chart scenarios.<n>We introduce MultiChartQA, a benchmark that evaluates MLLMs' capabilities in four key areas: direct question answering, parallel question answering, comparative reasoning, and sequential reasoning.<n>Our results highlight the challenges in multi-chart comprehension and the potential of MultiChartQA to drive advancements in this field.
arXiv Detail & Related papers (2024-10-18T05:15:50Z) - CHARTOM: A Visual Theory-of-Mind Benchmark for LLMs on Misleading Charts [26.477627174115806]
We introduce CHARTOM, a visual theory-of-mind benchmark designed to evaluate multimodal large language models' capability to understand and reason about misleading data visualizations though charts.<n> CHARTOM consists of carefully designed charts and associated questions that require a language model to not only correctly comprehend the factual content in the chart (the FACT question) but also judge whether the chart will be misleading to a human readers (the MIND question)<n>We detail the construction of our benchmark including its calibration on human performance and estimation of MIND ground truth called the Human Misleadingness Index.
arXiv Detail & Related papers (2024-08-26T17:04:23Z) - On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts.
We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z) - The Belebele Benchmark: a Parallel Reading Comprehension Dataset in 122 Language Variants [80.4837840962273]
We present Belebele, a dataset spanning 122 language variants.
This dataset enables the evaluation of text models in high-, medium-, and low-resource languages.
arXiv Detail & Related papers (2023-08-31T17:43:08Z) - Babel-ImageNet: Massively Multilingual Evaluation of Vision-and-Language Representations [53.89380284760555]
We introduce Babel-ImageNet, a massively multilingual benchmark that offers partial translations of ImageNet labels to 100 languages.
We evaluate 11 public multilingual CLIP models on our benchmark, demonstrating a significant gap between English ImageNet performance and that of high-resource languages.
We show that the performance of multilingual CLIP can be drastically improved for low-resource languages with parameter-efficient language-specific training.
arXiv Detail & Related papers (2023-06-14T17:53:06Z) - Chart-to-Text: A Large-Scale Benchmark for Chart Summarization [9.647079534077472]
We present Chart-to-text, a large-scale benchmark with two datasets and a total of 44,096 charts.
We explain the dataset construction process and analyze the datasets.
arXiv Detail & Related papers (2022-03-12T17:01:38Z) - Graph Neural Network Enhanced Language Models for Efficient Multilingual
Text Classification [8.147244878591014]
We propose a multilingual disaster related text classification system which is capable to work under mono, cross and multi lingual scenarios.
Our end-to-end trainable framework combines the versatility of graph neural networks, by applying over the corpus.
We evaluate our framework over total nine English, Non-English and monolingual datasets in mono, cross and multi lingual classification scenarios.
arXiv Detail & Related papers (2022-03-06T09:05:42Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.