The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes
- URL: http://arxiv.org/abs/2502.05961v1
- Date: Sun, 09 Feb 2025 17:09:46 GMT
- Title: The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes
- Authors: Siobhan Mackenzie Hall, Samantha Dalal, Raesetje Sefala, Foutse Yuehgoh, Aisha Alaagib, Imane Hamzaoui, Shu Ishida, Jabez Magomere, Lauren Crais, Aya Salama, Tejumade Afonja,
- Abstract summary: We provide a window into the process of constructing a dataset for machine learning (ML) applications by reflecting on the process of building World Wide Dishes (WWD)
WWD takes a participatory approach to dataset creation: community members guide the design of the research process and engage in crowdsourcing efforts to build the dataset.
We contribute empirical evidence of the invisible labour of participatory design work by analysing reflections from the research team behind WWD.
- Score: 3.770155074442168
- License:
- Abstract: We provide a window into the process of constructing a dataset for machine learning (ML) applications by reflecting on the process of building World Wide Dishes (WWD), an image and text dataset consisting of culinary dishes and their associated customs from around the world. WWD takes a participatory approach to dataset creation: community members guide the design of the research process and engage in crowdsourcing efforts to build the dataset. WWD responds to calls in ML to address the limitations of web-scraped Internet datasets with curated, high-quality data incorporating localised expertise and knowledge. Our approach supports decentralised contributions from communities that have not historically contributed to datasets as a result of a variety of systemic factors. We contribute empirical evidence of the invisible labour of participatory design work by analysing reflections from the research team behind WWD. In doing so, we extend computer-supported cooperative work (CSCW) literature that examines the post-hoc impacts of datasets when deployed in ML applications by providing a window into the dataset construction process. We surface four dimensions of invisible labour in participatory dataset construction: building trust with community members, making participation accessible, supporting data production, and understanding the relationship between data and culture. This paper builds upon the rich participatory design literature within CSCW to guide how future efforts to apply participatory design to dataset construction can be designed in a way that attends to the dynamic, collaborative, and fundamentally human processes of dataset creation.
Related papers
- Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots.
It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing.
We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z) - Deploying Large Language Models With Retrieval Augmented Generation [0.21485350418225244]
Retrieval Augmented Generation has emerged as a key approach for integrating knowledge from data sources outside of the large language model's training set.
We present insights from the development and field-testing of a pilot project that integrates LLMs with RAG for information retrieval.
arXiv Detail & Related papers (2024-11-07T22:11:51Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - The Dimensions of Data Labor: A Road Map for Researchers, Activists, and
Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits.
Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape.
By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z) - Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data
Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.