Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face
- URL: http://arxiv.org/abs/2401.13822v1
- Date: Wed, 24 Jan 2024 21:47:13 GMT
- Title: Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face
- Authors: Xinyu Yang, Weixin Liang, James Zou
- Abstract summary: We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
- Score: 46.60562029098208
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Advances in machine learning are closely tied to the creation of datasets.
While data documentation is widely recognized as essential to the reliability,
reproducibility, and transparency of ML, we lack a systematic empirical
understanding of current dataset documentation practices. To shed light on this
question, here we take Hugging Face -- one of the largest platforms for sharing
and collaborating on ML models and datasets -- as a prominent case study. By
analyzing all 7,433 dataset documentation on Hugging Face, our investigation
provides an overview of the Hugging Face dataset ecosystem and insights into
dataset documentation practices, yielding 5 main findings: (1) The dataset card
completion rate shows marked heterogeneity correlated with dataset popularity.
(2) A granular examination of each section within the dataset card reveals that
the practitioners seem to prioritize Dataset Description and Dataset Structure
sections, while the Considerations for Using the Data section receives the
lowest proportion of content. (3) By analyzing the subsections within each
section and utilizing topic modeling to identify key topics, we uncover what is
discussed in each section, and underscore significant themes encompassing both
technical and social impacts, as well as limitations within the Considerations
for Using the Data section. (4) Our findings also highlight the need for
improved accessibility and reproducibility of datasets in the Usage sections.
(5) In addition, our human annotation evaluation emphasizes the pivotal role of
comprehensive dataset content in shaping individuals' perceptions of a dataset
card's overall quality. Overall, our study offers a unique perspective on
analyzing dataset documentation through large-scale data science analysis and
underlines the need for more thorough dataset documentation in machine learning
research.
Related papers
- Exploratory Visual Analysis for Increasing Data Readiness in Artificial Intelligence Projects [7.982715506261976]
We contribute a mapping between data readiness aspects and visual analysis techniques suitable for different data types.
In addition to the mapping, we extend the data readiness concept to better take aspects of the task and solution into account.
We report on our experiences in using the presented visual analysis techniques to aid future artificial intelligence projects in raising the data readiness level.
arXiv Detail & Related papers (2024-09-05T09:57:14Z) - Diffusion Models as Data Mining Tools [87.77999285241219]
This paper demonstrates how to use generative models trained for image synthesis as tools for visual data mining.
We show that after finetuning conditional diffusion models to synthesize images from a specific dataset, we can use these models to define a typicality measure.
This measure assesses how typical visual elements are for different data labels, such as geographic location, time stamps, semantic labels, or even the presence of a disease.
arXiv Detail & Related papers (2024-07-20T17:14:31Z) - On the Cross-Dataset Generalization of Machine Learning for Network
Intrusion Detection [50.38534263407915]
Network Intrusion Detection Systems (NIDS) are a fundamental tool in cybersecurity.
Their ability to generalize across diverse networks is a critical factor in their effectiveness and a prerequisite for real-world applications.
In this study, we conduct a comprehensive analysis on the generalization of machine-learning-based NIDS through an extensive experimentation in a cross-dataset framework.
arXiv Detail & Related papers (2024-02-15T14:39:58Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z) - CrowdWorkSheets: Accounting for Individual and Collective Identities
Underlying Crowdsourced Dataset Annotation [8.447159556925182]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: (1) who the annotator is, and how the annotators' lived experiences can impact their annotations.
We introduce a novel framework, CrowdWorkSheets, for dataset developers to facilitate transparent documentation of key decisions points at various stages of the data annotation pipeline.
arXiv Detail & Related papers (2022-06-09T23:31:17Z) - Detection Hub: Unifying Object Detection Datasets via Query Adaptation
on Language Embedding [137.3719377780593]
A new design (named Detection Hub) is dataset-aware and category-aligned.
It mitigates the dataset inconsistency and provides coherent guidance for the detector to learn across multiple datasets.
The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding.
arXiv Detail & Related papers (2022-06-07T17:59:44Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - Representation Matters: Assessing the Importance of Subgroup Allocations
in Training Data [85.43008636875345]
We show that diverse representation in training data is key to increasing subgroup performances and achieving population level objectives.
Our analysis and experiments describe how dataset compositions influence performance and provide constructive results for using trends in existing data, alongside domain knowledge, to help guide intentional, objective-aware dataset design.
arXiv Detail & Related papers (2021-03-05T00:27:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.