Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata
- URL: http://arxiv.org/abs/2206.02923v1
- Date: Mon, 6 Jun 2022 21:55:39 GMT
- Title: Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata
- Authors: Amy Heger, Elizabeth B. Marquis, Mihaela Vorvoreanu, Hanna Wallach,
Jennifer Wortman Vaughan
- Abstract summary: Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
- Score: 10.689661834716613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is central to the development and evaluation of machine learning (ML)
models. However, the use of problematic or inappropriate datasets can result in
harms when the resulting models are deployed. To encourage responsible AI
practice through more deliberate reflection on datasets and transparency around
the processes by which they are created, researchers and practitioners have
begun to advocate for increased data documentation and have proposed several
data documentation frameworks. However, there is little research on whether
these data documentation frameworks meet the needs of ML practitioners, who
both create and consume datasets. To address this gap, we set out to understand
ML practitioners' data documentation perceptions, needs, challenges, and
desiderata, with the goal of deriving design requirements that can inform
future data documentation frameworks. We conducted a series of semi-structured
interviews with 14 ML practitioners at a single large, international technology
company. We had them answer a list of questions taken from datasheets for
datasets (Gebru, 2021). Our findings show that current approaches to data
documentation are largely ad hoc and myopic in nature. Participants expressed
needs for data documentation frameworks to be adaptable to their contexts,
integrated into their existing tools and workflows, and automated wherever
possible. Despite the fact that data documentation frameworks are often
motivated from the perspective of responsible AI, participants did not make the
connection between the questions that they were asked to answer and their
responsible AI implications. In addition, participants often had difficulties
prioritizing the needs of dataset consumers and providing information that
someone unfamiliar with their datasets might need to know. Based on these
findings, we derive seven design requirements for future data documentation
frameworks.
Related papers
- Capturing and Anticipating User Intents in Data Analytics via Knowledge Graphs [0.061446808540639365]
This work explores the usage of Knowledge Graphs (KG) as a basic framework for capturing a human-centered manner complex analytics.
The data stored in the generated KG can then be exploited to provide assistance (e.g., recommendations) to the users interacting with these systems.
arXiv Detail & Related papers (2024-11-01T20:45:23Z) - Synthetic Data Generation with Large Language Models for Personalized Community Question Answering [47.300506002171275]
We build Sy-SE-PQA based on an existing dataset, SE-PQA, which consists of questions and answers posted on the popular StackExchange communities.
Our findings suggest that LLMs have high potential in generating data tailored to users' needs.
The synthetic data can replace human-written training data, even if the generated data may contain incorrect information.
arXiv Detail & Related papers (2024-10-29T16:19:08Z) - Data Formulator 2: Iteratively Creating Rich Visualizations with AI [65.48447317310442]
We present Data Formulator 2, an LLM-powered visualization system to address these challenges.
With Data Formulator 2, users describe their visualization intent with blended UI and natural language inputs, and data transformation are delegated to AI.
To support iteration, Data Formulator 2 lets users navigate their iteration history and reuse previous designs towards new ones so that they don't need to start from scratch every time.
arXiv Detail & Related papers (2024-08-28T20:12:17Z) - A Standardized Machine-readable Dataset Documentation Format for Responsible AI [8.59437843168878]
Croissant-RAI is a machine-readable metadata format designed to enhance the discoverability, interoperability, and trustworthiness of AI datasets.
It is integrated into major data search engines, repositories, and machine learning frameworks.
arXiv Detail & Related papers (2024-06-04T16:40:14Z) - Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices.
We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.