DV-World: Benchmarking Data Visualization Agents in Real-World Scenarios
Abstract Overview
DV-World is a benchmark of 260 tasks designed to evaluate data-visualization agents across realistic professional workflows rather than isolated code-sandbox settings. It comprises three domains: DV-Sheet for native spreadsheet charting, diagnostic repair, and dashboard construction; DV-Evol for adapting reference visualizations to new data across five programming frameworks (Python, Apache ECharts, Vega-Lite, D3.js, Plotly.js); and DV-Inter for multi-turn clarification under ambiguous user intent using a dual-stage user simulator. The benchmark employs a hybrid evaluation framework combining Table-value Alignment for data fidelity with rubric-based MLLM-as-a-Judge for semantic and visual quality, plus an Interaction Success Rate for dialog tasks. The authors position these components as a unified suite testing native environmental grounding, cross-platform evolution, and proactive intent alignment.
Novelty
The paper introduces a benchmark that spans the full lifecycle of professional data-visualization work, uniquely combining native spreadsheet object-model manipulation, cross-framework visualization evolution across five paradigms, and interactive intent clarification with a validated dual-stage user simulator. It also proposes a hybrid evaluation setup mixing rule-based checks, table-alignment signals, rubric-guided MLLM judging, and interaction success metrics, validated against human judgments with strong agreement (weighted κ = 0.821, ICC = 0.850 for the primary judge).
Results
Experiments show that current state-of-the-art agents perform well below human baselines across all three domains: the best reported scores are 40.48% on DV-Sheet (Gemini-3-Pro), 51.44% on DV-Evol (Gemini-3-Pro), and 40.43% on DV-Inter (Grok-4), compared to human baselines of 80.81%, 85.23%, and 79.60% respectively. Analysis reveals recurring weaknesses in spreadsheet object-model handling, cross-paradigm semantic transfer (especially for verbose frameworks like D3.js), and effective clarification during interaction.
Key Points
- DV-World defines 260 benchmark tasks across three settings—native spreadsheet visualization (chart creation, repair, dashboards), cross-framework visualization evolution (Python, ECharts, Vega-Lite, D3.js, Plotly.js), and interactive ambiguity resolution with a dual-stage user simulator.
- The hybrid evaluation framework combines quantitative Table-value Alignment for data fidelity with rubric-based MLLM judging (validated at weighted κ = 0.821 against human experts) and an Interaction Success Rate metric for dialog tasks.
- Benchmark results show that even leading agents achieve less than 52% on any domain versus human baselines above 79%, exposing critical deficits in native object-model mastery, cross-paradigm semantic preservation, and proactive intent alignment.
References
- arXiv: https://arxiv.org/abs/2604.25914v1
- Fugu-MT: https://fugumt.com/fugumt/paper_check/2604.25914v1
- Hugging Face Papers: https://huggingface.co/papers/2604.25914