Related papers: Do Large Language Models Understand Data Visualization Rules?

Do Large Language Models Understand Data Visualization Rules?

URL: http://arxiv.org/abs/2602.20137v1
Date: Mon, 23 Feb 2026 18:47:51 GMT
Title: Do Large Language Models Understand Data Visualization Rules?
Authors: Martin Sinnona, Valentin Bonas, Emmanuel Iarussi, Viviana Siless,
Abstract summary: Large language models (LLMs) can generate charts or flag misleading figures, but it remains unclear whether they can reason about and enforce visualization rules directly.<n>We present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP)<n>Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 0.15 for some categories) and for outputs generated from technical
Score: 2.3332469289621787
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Data visualization rules-derived from decades of research in design and perception-ensure trustworthy chart communication. While prior work has shown that large language models (LLMs) can generate charts or flag misleading figures, it remains unclear whether they can reason about and enforce visualization rules directly. Constraint-based systems such as Draco encode these rules as logical constraints for precise automated checks, but maintaining symbolic encodings requires expert effort, motivating the use of LLMs as flexible rule validators. In this paper, we present the first systematic evaluation of LLMs against visualization rules using hard-verification ground truth derived from Answer Set Programming (ASP). We translated a subset of Draco's constraints into natural-language statements and generated a controlled dataset of 2,000 Vega-Lite specifications annotated with explicit rule violations. LLMs were evaluated on both accuracy in detecting violations and prompt adherence, which measures whether outputs follow the required structured format. Results show that frontier models achieve high adherence (Gemma 3 4B / 27B: 100%, GPT-oss 20B: 98%) and reliably detect common violations (F1 up to 0.82),yet performance drops for subtler perceptual rules (F1 < 0.15 for some categories) and for outputs generated from technical ASP formulations.Translating constraints into natural language improved performance by up to 150% for smaller models. These findings demonstrate the potential of LLMs as flexible, language-driven validators while highlighting their current limitations compared to symbolic solvers.

Related papers

Do Large Language Models Understand Data Visualization Principles? [2.3332469289621787]
It remains unclear whether large language models (LLMs) and vision-language counterparts (VLMs) can reason about and enforce visualization principles directly.<n>We evaluate both checking and fixing tasks, assessing how well models detect principle violations and correct flawed chart specifications.<n>Our work highlights both the promise of large (vision-)language models as flexible validators and editors of visualization designs and the persistent gap with symbolic solvers on more nuanced aspects of visual perception.
arXiv Detail & Related papers (2026-02-23T17:51:06Z)
Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math [80.46254366870447]
We introduce Hard2Verify, a step-level verification benchmark produced with over 500 hours of human labor.<n>We evaluate 29 generative critics and process reward models, demonstrating that, beyond a few standouts, open-source verifiers lag closed source models.
arXiv Detail & Related papers (2025-10-15T16:50:54Z)
Do What? Teaching Vision-Language-Action Models to Reject the Impossible [53.40183895299108]
Vision-Language-Action (VLA) models have demonstrated strong performance on a range of robotic tasks.<n>We propose Instruct-Verify-and-Act (IVA), a framework that detects when an instruction cannot be executed due to a false premise.<n>Our experiments show that IVA improves false premise detection accuracy by 97.56% over baselines.
arXiv Detail & Related papers (2025-08-22T10:54:33Z)
Rule2Text: A Framework for Generating and Evaluating Natural Language Explanations of Knowledge Graph Rules [0.998900149624725]
Rule2Text is a framework that leverages large language models to generate natural language explanations for mined logical rules.<n>Our results demonstrate significant improvements in explanation quality after fine-tuning, with particularly strong gains in the domain-specific dataset.
arXiv Detail & Related papers (2025-08-14T16:41:47Z)
LLM-based Satisfiability Checking of String Requirements by Consistent Data and Checker Generation [2.892899073587433]
Large language models (LLMs) have emerged as an alternative approach for formal reasoning tasks.<n>In this paper, we introduce a hybrid approach that verifies the satisfiability of NL requirements over strings.<n>LLMs effectively translate natural language into checkers, even achieving perfect testing accuracy for Python-based checkers.
arXiv Detail & Related papers (2025-06-19T22:41:43Z)
Fast Controlled Generation from Language Models with Adaptive Weighted Rejection Sampling [90.86991492288487]
evaluating constraint on every token can be prohibitively expensive.<n> LCD can distort the global distribution over strings, sampling tokens based only on local information.<n>We show that our approach is superior to state-of-the-art baselines.
arXiv Detail & Related papers (2025-04-07T18:30:18Z)
Training Large Recommendation Models via Graph-Language Token Alignment [53.3142545812349]
We propose a novel framework to train Large Recommendation models via Graph-Language Token Alignment.<n>By aligning item and user nodes from the interaction graph with pretrained LLM tokens, GLTA effectively leverages the reasoning abilities of LLMs.<n> Furthermore, we introduce Graph-Language Logits Matching (GLLM) to optimize token alignment for end-to-end item prediction.
arXiv Detail & Related papers (2025-02-26T02:19:10Z)
Model Generalization on Text Attribute Graphs: Principles with Large Language Models [14.657522068231138]
Large language models (LLMs) have been introduced to graph learning, aiming to extend their zero-shot generalization success to tasks where labeled graph data is scarce.<n>We develop a framework for inference over text-attributed graphs (TAGs) using task-adaptive embeddings and a graph information aggregation mechanism.<n> Evaluations on 11 real-world TAG benchmarks demonstrate that LLM-BP significantly outperforms existing approaches.
arXiv Detail & Related papers (2025-02-17T14:31:00Z)
RuleArena: A Benchmark for Rule-Guided Reasoning with LLMs in Real-World Scenarios [58.90106984375913]
RuleArena is a novel and challenging benchmark designed to evaluate the ability of large language models (LLMs) to follow complex, real-world rules in reasoning.<n> Covering three practical domains -- airline baggage fees, NBA transactions, and tax regulations -- RuleArena assesses LLMs' proficiency in handling intricate natural language instructions.
arXiv Detail & Related papers (2024-12-12T06:08:46Z)
Exploring Iterative Controllable Summarization with Large Language Models [22.80433394369022]
Large language models (LLMs) have demonstrated remarkable performance in abstractive summarization tasks.<n>Our findings show that LLMs struggle more with numerical attributes than with linguistic attributes.<n>We propose a guide-to-explain framework (GTE) for controllable summarization.
arXiv Detail & Related papers (2024-11-19T12:36:02Z)
DECIDER: A Dual-System Rule-Controllable Decoding Framework for Language Generation [57.07295906718989]
Constrained decoding approaches aim to control the meaning or style of text generated by pre-trained large language (Ms also PLMs) for various tasks at inference time.<n>These methods often guide plausible continuations by greedily and explicitly selecting targets.<n>Inspired by cognitive dual-process theory, we propose a novel decoding framework DECIDER.
arXiv Detail & Related papers (2024-03-04T11:49:08Z)
ChatRule: Mining Logical Rules with Large Language Models for Knowledge Graph Reasoning [107.61997887260056]
We propose a novel framework, ChatRule, unleashing the power of large language models for mining logical rules over knowledge graphs. Specifically, the framework is initiated with an LLM-based rule generator, leveraging both the semantic and structural information of KGs. To refine the generated rules, a rule ranking module estimates the rule quality by incorporating facts from existing KGs.
arXiv Detail & Related papers (2023-09-04T11:38:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.