Related papers: Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability

URL: http://arxiv.org/abs/2506.01789v2
Date: Tue, 03 Jun 2025 04:18:39 GMT
Title: Datasheets Aren't Enough: DataRubrics for Automated Quality Metrics and Accountability
Authors: Genta Indra Winata, David Anugraha, Emmy Liu, Alham Fikri Aji, Shou-Yi Hung, Aditya Parashar, Patrick Amadeus Irawan, Ruochen Zhang, Zheng-Xin Yong, Jan Christian Blaise Cruz, Niklas Muennighoff, Seungone Kim, Hanyang Zhao, Sudipta Kar, Kezia Erina Suryoraharjo, M. Farid Adilazuarda, En-Shiun Annie Lee, Ayu Purwarianti, Derry Tanti Wijaya, Monojit Choudhury,
Abstract summary: This position paper advocates for the integration of systematic, descriptive-based evaluation metrics into the dataset review process.<n>We introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets.
Score: 41.23032741638842
License: http://creativecommons.org/licenses/by/4.0/
Abstract: High-quality datasets are fundamental to training and evaluating machine learning models, yet their creation-especially with accurate human annotations-remains a significant challenge. Many dataset paper submissions lack originality, diversity, or rigorous quality control, and these shortcomings are often overlooked during peer review. Submissions also frequently omit essential details about dataset construction and properties. While existing tools such as datasheets aim to promote transparency, they are largely descriptive and do not provide standardized, measurable methods for evaluating data quality. Similarly, metadata requirements at conferences promote accountability but are inconsistently enforced. To address these limitations, this position paper advocates for the integration of systematic, rubric-based evaluation metrics into the dataset review process-particularly as submission volumes continue to grow. We also explore scalable, cost-effective methods for synthetic data generation, including dedicated tools and LLM-as-a-judge approaches, to support more efficient evaluation. As a call to action, we introduce DataRubrics, a structured framework for assessing the quality of both human- and model-generated datasets. Leveraging recent advances in LLM-based evaluation, DataRubrics offers a reproducible, scalable, and actionable solution for dataset quality assessment, enabling both authors and reviewers to uphold higher standards in data-centric research. We also release code to support reproducibility of LLM-based evaluations at https://github.com/datarubrics/datarubrics.

Related papers

Large Language Models and Synthetic Data for Monitoring Dataset Mentions in Research Papers [0.0]
This paper presents a machine learning framework that automates dataset mention detection across research domains.<n>We employ zero-shot extraction from research papers, an LLM-as-a-Judge for quality assessment, and a reasoning agent for refinement to generate a weakly supervised synthetic dataset.<n>At inference, a ModernBERT-based classifier efficiently filters dataset mentions, reducing computational overhead while maintaining high recall.
arXiv Detail & Related papers (2025-02-14T16:16:02Z)
Star-Agents: Automatic Data Optimization with LLM Agents for Instruction Tuning [71.2981957820888]
We propose a novel Star-Agents framework, which automates the enhancement of data quality across datasets. The framework initially generates diverse instruction data with multiple LLM agents through a bespoke sampling method. The generated data undergo a rigorous evaluation using a dual-model method that assesses both difficulty and quality.
arXiv Detail & Related papers (2024-11-21T02:30:53Z)
Quality Matters: Evaluating Synthetic Data for Tool-Using LLMs [11.24476329991465]
Training large language models (LLMs) for external tool usage is a rapidly expanding field. The absence of systematic data quality checks poses complications for properly training and testing models. We propose two approaches for assessing the reliability of data for training LLMs to use external tools.
arXiv Detail & Related papers (2024-09-24T17:20:02Z)
Automatic Dataset Construction (ADC): Sample Collection, Data Curation, and Beyond [38.89457061559469]
We propose an innovative methodology that automates dataset creation with negligible cost and high efficiency. We provide open-source software that incorporates existing methods for label error detection, robust learning under noisy and biased data. We design three benchmark datasets focused on label noise detection, label noise learning, and class-imbalanced learning.
arXiv Detail & Related papers (2024-08-21T04:45:12Z)
DCA-Bench: A Benchmark for Dataset Curation Agents [9.60250892491588]
Data quality issues, such as incomplete documentation, inaccurate labels, ethical concerns, and outdated information, remain common in widely used datasets.<n>With the surging ability of large language models (LLM), it's promising to streamline the discovery of hidden dataset issues with LLM agents.<n>In this work, we establish a benchmark to measure LLM agent's ability to tackle this challenge.
arXiv Detail & Related papers (2024-06-11T14:02:23Z)
Automating Dataset Updates Towards Reliable and Timely Evaluation of Large Language Models [81.27391252152199]
Large language models (LLMs) have achieved impressive performance across various natural language benchmarks. We propose to automate dataset updating and provide systematic analysis regarding its effectiveness. There are two updating strategies: 1) mimicking strategy to generate similar samples based on original data, and 2) extending strategy that further expands existing samples.
arXiv Detail & Related papers (2024-02-19T07:15:59Z)
On the Evaluation and Refinement of Vision-Language Instruction Tuning Datasets [71.54954966652286]
We try to evaluate the Vision-Language Instruction-Tuning (VLIT) datasets. We build a new dataset, REVO-LION, by collecting samples with higher SQ from each dataset. Remarkably, even with only half of the complete data, the model trained on REVO-LION can achieve the performance comparable to simply adding all VLIT datasets up.
arXiv Detail & Related papers (2023-10-10T13:01:38Z)
Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection [8.12993269922936]
We argue that data collection for AI should be performed in a responsible manner. We propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics.
arXiv Detail & Related papers (2023-08-22T18:01:27Z)
Analyzing Dataset Annotation Quality Management in the Wild [63.07224587146207]
Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts. While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
arXiv Detail & Related papers (2023-07-16T21:22:40Z)
STAR: Boosting Low-Resource Information Extraction by Structure-to-Text Data Generation with Large Language Models [56.27786433792638]
STAR is a data generation method that leverages Large Language Models (LLMs) to synthesize data instances. We design fine-grained step-by-step instructions to obtain the initial data instances. Our experiments show that the data generated by STAR significantly improve the performance of low-resource event extraction and relation extraction tasks.
arXiv Detail & Related papers (2023-05-24T12:15:19Z)
Evaluation of Synthetic Datasets for Conversational Recommender Systems [0.0]
The absence of robust evaluation frameworks has been a long-standing problem. Since the quality of training data is critical for downstream applications, it is important to develop metrics that evaluate the quality holistically. In this paper, we present a framework that takes a multi-faceted approach towards evaluating datasets produced by generative models.
arXiv Detail & Related papers (2022-12-12T18:53:10Z)
DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms. We provide an open, online platform with multiple rounds of challenges to support this iterative development. The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.