Do Text-to-Vis Benchmarks Test Real Use of Visualisations?
- URL: http://arxiv.org/abs/2407.19726v4
- Date: Tue, 8 Oct 2024 02:49:54 GMT
- Title: Do Text-to-Vis Benchmarks Test Real Use of Visualisations?
- Authors: Hy Nguyen, Xuefei He, Andrew Reeson, Cecile Paris, Josiah Poon, Jonathan K. Kummerfeld,
- Abstract summary: This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories.
Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples.
One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark.
This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs.
- Score: 11.442971909006657
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models are able to generate code for visualisations in response to simple user requests. This is a useful application and an appealing one for NLP research because plots of data provide grounding for language. However, there are relatively few benchmarks, and those that exist may not be representative of what users do in practice. This paper investigates whether benchmarks reflect real-world use through an empirical study comparing benchmark datasets with code from public repositories. Our findings reveal a substantial gap, with evaluations not testing the same distribution of chart types, attributes, and actions as real-world examples. One dataset is representative, but requires extensive modification to become a practical end-to-end benchmark. This shows that new benchmarks are needed to support the development of systems that truly address users' visualisation needs. These observations will guide future data creation, highlighting which features hold genuine significance for users.
Related papers
- Benchmark Data Repositories for Better Benchmarking [26.15831504718431]
In machine learning research, it is common to evaluate algorithms via their performance on benchmark datasets.
We analyze the landscape of these $textitbenchmark data repositories and the role they can play in improving benchmarking.
arXiv Detail & Related papers (2024-10-31T16:30:08Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - A Review of Benchmarks for Visual Defect Detection in the Manufacturing
Industry [63.52264764099532]
We propose a study of existing benchmarks to compare and expose their characteristics and their use-cases.
A study of industrial metrics requirements, as well as testing procedures, will be presented and applied to the studied benchmarks.
arXiv Detail & Related papers (2023-05-05T07:44:23Z) - Dense Sparse Retrieval: Using Sparse Language Models for Inference
Efficient Dense Retrieval [37.22592489907125]
We study how sparse language models can be used for dense retrieval to improve inference efficiency.
We find that sparse language models can be used as direct replacements with little to no drop in accuracy and up to 4.3x improved inference speeds.
arXiv Detail & Related papers (2023-03-31T20:21:32Z) - GEMv2: Multilingual NLG Benchmarking in a Single Line of Code [161.1761414080574]
Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers.
GEMv2 supports 40 documented datasets in 51 languages.
Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
arXiv Detail & Related papers (2022-06-22T17:52:30Z) - A Closer Look at Debiased Temporal Sentence Grounding in Videos:
Dataset, Metric, and Approach [53.727460222955266]
Temporal Sentence Grounding in Videos (TSGV) aims to ground a natural language sentence in an untrimmed video.
Recent studies have found that current benchmark datasets may have obvious moment annotation biases.
We introduce a new evaluation metric "dR@n,IoU@m" that discounts the basic recall scores to alleviate the inflating evaluation caused by biased datasets.
arXiv Detail & Related papers (2022-03-10T08:58:18Z) - Towards Understanding Sample Variance in Visually Grounded Language
Generation: Evaluations and Observations [67.4375210552593]
We design experiments to understand an important but often ignored problem in visually grounded language generation.
Given that humans have different utilities and visual attention, how will the sample variance in multi-reference datasets affect the models' performance?
We show that it is of paramount importance to report variance in experiments; that human-generated references could vary drastically in different datasets/tasks, revealing the nature of each task.
arXiv Detail & Related papers (2020-10-07T20:45:14Z) - Towards Learning a Universal Non-Semantic Representation of Speech [18.54874934311111]
This paper proposes a benchmark for comparing speech representations on non-semantic tasks, and proposes a representation based on an unsupervised triplet-loss objective.
The proposed representation outperforms other representations on the benchmark, and even exceeds state-of-the-art performance on a number of transfer learning tasks.
arXiv Detail & Related papers (2020-02-25T21:38:24Z) - ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine
Reading Comprehension [53.037401638264235]
We present an evaluation server, ORB, that reports performance on seven diverse reading comprehension datasets.
The evaluation server places no restrictions on how models are trained, so it is a suitable test bed for exploring training paradigms and representation learning.
arXiv Detail & Related papers (2019-12-29T07:27:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.