Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata
- URL: http://arxiv.org/abs/2206.02923v1
- Date: Mon, 6 Jun 2022 21:55:39 GMT
- Title: Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata
- Authors: Amy Heger, Elizabeth B. Marquis, Mihaela Vorvoreanu, Hanna Wallach,
Jennifer Wortman Vaughan
- Abstract summary: Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
- Score: 10.689661834716613
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Data is central to the development and evaluation of machine learning (ML)
models. However, the use of problematic or inappropriate datasets can result in
harms when the resulting models are deployed. To encourage responsible AI
practice through more deliberate reflection on datasets and transparency around
the processes by which they are created, researchers and practitioners have
begun to advocate for increased data documentation and have proposed several
data documentation frameworks. However, there is little research on whether
these data documentation frameworks meet the needs of ML practitioners, who
both create and consume datasets. To address this gap, we set out to understand
ML practitioners' data documentation perceptions, needs, challenges, and
desiderata, with the goal of deriving design requirements that can inform
future data documentation frameworks. We conducted a series of semi-structured
interviews with 14 ML practitioners at a single large, international technology
company. We had them answer a list of questions taken from datasheets for
datasets (Gebru, 2021). Our findings show that current approaches to data
documentation are largely ad hoc and myopic in nature. Participants expressed
needs for data documentation frameworks to be adaptable to their contexts,
integrated into their existing tools and workflows, and automated wherever
possible. Despite the fact that data documentation frameworks are often
motivated from the perspective of responsible AI, participants did not make the
connection between the questions that they were asked to answer and their
responsible AI implications. In addition, participants often had difficulties
prioritizing the needs of dataset consumers and providing information that
someone unfamiliar with their datasets might need to know. Based on these
findings, we derive seven design requirements for future data documentation
frameworks.
Related papers
- Machine Learning Data Practices through a Data Curation Lens: An Evaluation Framework [1.5993707490601146]
We evaluate data practices in machine learning as data curation practices.
We find that researchers in machine learning, which often emphasizes model development, struggle to apply standard data curation principles.
arXiv Detail & Related papers (2024-05-04T16:21:05Z) - The Frontier of Data Erasure: Machine Unlearning for Large Language Models [56.26002631481726]
Large Language Models (LLMs) are foundational to AI advancements.
LLMs pose risks by potentially memorizing and disseminating sensitive, biased, or copyrighted information.
Machine unlearning emerges as a cutting-edge solution to mitigate these concerns.
arXiv Detail & Related papers (2024-03-23T09:26:15Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - Open Datasheets: Machine-readable Documentation for Open Datasets and Responsible AI Assessments [9.125552623625806]
This paper introduces a no-code, machine-readable documentation framework for open datasets.
The framework aims to improve comprehensibility, and usability of open datasets.
The framework is expected to enhance the quality and reliability of data used in research and decision-making.
arXiv Detail & Related papers (2023-12-11T06:41:14Z) - Data Acquisition: A New Frontier in Data-centric AI [65.90972015426274]
We first present an investigation of current data marketplaces, revealing lack of platforms offering detailed information about datasets.
We then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers.
Our evaluation of the submitted strategies underlines the need for effective data acquisition strategies in Machine Learning.
arXiv Detail & Related papers (2023-11-22T22:15:17Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and
Regulatory Norms [58.93352076927003]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - DAT: Data Architecture Modeling Tool for Data-Driven Applications [1.6037279419318131]
Data Architecture (DA) focuses on describing, collecting, storing, processing, and analyzing the data to meet business needs.
We present the DAT, a model-driven engineering tool enabling data architects, data engineers, and other stakeholders to describe how data flows through the system.
arXiv Detail & Related papers (2023-06-21T11:24:59Z) - Towards Complex Document Understanding By Discrete Reasoning [77.91722463958743]
Document Visual Question Answering (VQA) aims to understand visually-rich documents to answer questions in natural language.
We introduce a new Document VQA dataset, named TAT-DQA, which consists of 3,067 document pages and 16,558 question-answer pairs.
We develop a novel model named MHST that takes into account the information in multi-modalities, including text, layout and visual image, to intelligently address different types of questions.
arXiv Detail & Related papers (2022-07-25T01:43:19Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - REGRAD: A Large-Scale Relational Grasp Dataset for Safe and
Object-Specific Robotic Grasping in Clutter [52.117388513480435]
We present a new dataset named regrad to sustain the modeling of relationships among objects and grasps.
Our dataset is collected in both forms of 2D images and 3D point clouds.
Users are free to import their own object models for the generation of as many data as they want.
arXiv Detail & Related papers (2021-04-29T05:31:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.