Documenting Data Production Processes: A Participatory Approach for Data
Work
- URL: http://arxiv.org/abs/2207.04958v2
- Date: Tue, 9 Aug 2022 19:02:14 GMT
- Title: Documenting Data Production Processes: A Participatory Approach for Data
Work
- Authors: Milagros Miceli, Tianling Yang, Adriana Alvarado Garcia, Julian
Posada, Sonja Mei Wang, Marc Pohl, Alex Hanna
- Abstract summary: opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
- Score: 4.811554861191618
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The opacity of machine learning data is a significant threat to ethical data
work and intelligible systems. Previous research has addressed this issue by
proposing standardized checklists to document datasets. This paper expands that
field of inquiry by proposing a shift of perspective: from documenting datasets
toward documenting data production. We draw on participatory design and
collaborate with data workers at two companies located in Bulgaria and
Argentina, where the collection and annotation of data for machine learning are
outsourced. Our investigation comprises 2.5 years of research, including 33
semi-structured interviews, five co-design workshops, the development of
prototypes, and several feedback instances with participants. We identify key
challenges and requirements related to the integration of documentation
practices in real-world data production scenarios. Our findings comprise
important design considerations and highlight the value of designing data
documentation based on the needs of data workers. We argue that a view of
documentation as a boundary object, i.e., an object that can be used
differently across organizations and teams but holds enough immutable content
to maintain integrity, can be useful when designing documentation to retrieve
heterogeneous, often distributed, contexts of data production.
Related papers
- Using Large Language Models to Generate Authentic Multi-agent Knowledge Work Datasets [5.465422605475246]
Current publicly available knowledge work data collections lack diversity, extensive annotations, and contextual information about the users and their documents.
This paper introduces our approach's design and vision and focuses on generating authentic knowledge work documents using Large Language Models.
Our study involving human raters who assessed 53% of the generated and 74% of the real documents as realistic demonstrates the potential of our approach.
arXiv Detail & Related papers (2024-09-06T13:53:28Z) - Navigating Dataset Documentations in AI: A Large-Scale Analysis of
Dataset Cards on Hugging Face [46.60562029098208]
We analyze all 7,433 dataset documentation on Hugging Face.
Our study offers a unique perspective on analyzing dataset documentation through large-scale data science analysis.
arXiv Detail & Related papers (2024-01-24T21:47:13Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - On Task-personalized Multimodal Few-shot Learning for Visually-rich
Document Entity Retrieval [59.25292920967197]
Few-shot document entity retrieval (VDER) is an important topic in industrial NLP applications.
FewVEX is a new dataset to boost future research in the field of entity-level few-shot VDER.
We present a task-aware meta-learning based framework, with a central focus on achieving effective task personalization.
arXiv Detail & Related papers (2023-11-01T17:51:43Z) - DocumentNet: Bridging the Data Gap in Document Pre-Training [78.01647768018485]
We propose a method to collect massive-scale and weakly labeled data from the web to benefit the training of VDER models.
The collected dataset, named DocumentNet, does not depend on specific document types or entity sets.
Experiments on a set of broadly adopted VDER tasks show significant improvements when DocumentNet is incorporated into the pre-training.
arXiv Detail & Related papers (2023-06-15T08:21:15Z) - Doc2Bot: Accessing Heterogeneous Documents via Conversational Bots [103.54897676954091]
Doc2Bot is a dataset for building machines that help users seek information via conversations.
Our dataset contains over 100,000 turns based on Chinese documents from five domains.
arXiv Detail & Related papers (2022-10-20T07:33:05Z) - Layout-Aware Information Extraction for Document-Grounded Dialogue:
Dataset, Method and Demonstration [75.47708732473586]
We propose a layout-aware document-level Information Extraction dataset, LIE, to facilitate the study of extracting both structural and semantic knowledge from visually rich documents.
LIE contains 62k annotations of three extraction tasks from 4,061 pages in product and official documents.
Empirical results show that layout is critical for VRD-based extraction, and system demonstration also verifies that the extracted knowledge can help locate the answers that users care about.
arXiv Detail & Related papers (2022-07-14T07:59:45Z) - Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - A Survey of Historical Document Image Datasets [2.8707038627097226]
This paper presents a systematic literature review of image datasets for document image analysis.
It focuses on historical documents, such as handwritten manuscripts and early prints.
Finding appropriate datasets for historical document analysis is a crucial prerequisite to facilitate research using different machine learning algorithms.
arXiv Detail & Related papers (2022-03-16T09:56:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.