FuguReport

DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios

Authors Jinxiang Meng, Shaoping Huang, Fangyu Lei, Jingyu Guo, Haoxiang Liu, Jiahao Su, Sihan Wang, Yao Wang, Enrui Wang, Ye Yang, Hongze Chai, Jinming Lv, Anbang Yu, Huangjing Zhang, Yitong Zhang, Yiming Huang, Zeyao Ma, Shizhu He, Jun Zhao, Kang Liu
Affiliations Chinese Academy of Sciences / University of the Chinese Academy of Sciences / National University of Singapore / Renmin University of China
Categories Evaluation / Benchmarking / Data visualization agent performance, Application / Data Visualization / Professional lifecycle tasks, Method / Agent / Cross-platform agent evaluation
License CC BY 4.0

Abstract Overview

DV-World is a benchmark of 260 tasks designed to evaluate data-visualization agents across realistic professional workflows rather than isolated code-sandbox settings. It comprises three domains: DV-Sheet for native spreadsheet charting, diagnostic repair, and dashboard construction; DV-Evol for adapting reference visualizations to new data across five programming frameworks (Python, Apache ECharts, Vega-Lite, D3.js, Plotly.js); and DV-Inter for multi-turn clarification under ambiguous user intent using a dual-stage user simulator. The benchmark employs a hybrid evaluation framework combining Table-value Alignment for data fidelity with rubric-based MLLM-as-a-Judge for semantic and visual quality, plus an Interaction Success Rate for dialog tasks. The authors position these components as a unified suite testing native environmental grounding, cross-platform evolution, and proactive intent alignment.

Novelty

The paper introduces a benchmark that spans the full lifecycle of professional data-visualization work, uniquely combining native spreadsheet object-model manipulation, cross-framework visualization evolution across five paradigms, and interactive intent clarification with a validated dual-stage user simulator. It also proposes a hybrid evaluation setup mixing rule-based checks, table-alignment signals, rubric-guided MLLM judging, and interaction success metrics, validated against human judgments with strong agreement (weighted κ = 0.821, ICC = 0.850 for the primary judge).

Results

Experiments show that current state-of-the-art agents perform well below human baselines across all three domains: the best reported scores are 40.48% on DV-Sheet (Gemini-3-Pro), 51.44% on DV-Evol (Gemini-3-Pro), and 40.43% on DV-Inter (Grok-4), compared to human baselines of 80.81%, 85.23%, and 79.60% respectively. Analysis reveals recurring weaknesses in spreadsheet object-model handling, cross-paradigm semantic transfer (especially for verbose frameworks like D3.js), and effective clarification during interaction.

Key Points

  1. DV-World defines 260 benchmark tasks across three settings—native spreadsheet visualization (chart creation, repair, dashboards), cross-framework visualization evolution (Python, ECharts, Vega-Lite, D3.js, Plotly.js), and interactive ambiguity resolution with a dual-stage user simulator.
  2. The hybrid evaluation framework combines quantitative Table-value Alignment for data fidelity with rubric-based MLLM judging (validated at weighted κ = 0.821 against human experts) and an Interaction Success Rate metric for dialog tasks.
  3. Benchmark results show that even leading agents achieve less than 52% on any domain versus human baselines above 79%, exposing critical deficits in native object-model mastery, cross-paradigm semantic preservation, and proactive intent alignment.

References

This page was created using generative AI such as GPT-5, Claude Opus 4, Gemini 3, Gemini 3.1 Flash Image, and their higher-end successor versions. No guarantee can be made regarding its contents.