The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes
- URL: http://arxiv.org/abs/2502.05961v2
- Date: Mon, 05 May 2025 16:29:14 GMT
- Title: The Human Labour of Data Work: Capturing Cultural Diversity through World Wide Dishes
- Authors: Siobhan Mackenzie Hall, Samantha Dalal, Raesetje Sefala, Foutse Yuehgoh, Aisha Alaagib, Imane Hamzaoui, Shu Ishida, Jabez Magomere, Lauren Crais, Aya Salama, Tejumade Afonja,
- Abstract summary: We present an example of participatory dataset creation, where community members both guide the design of the research process and contribute to the crowdsourced dataset.<n>We show that our approach can result in curated, high-quality data that supports decentralised contributions from communities.<n>We surface three dimensions of labour performed by participatory mediators that are crucial for participatory dataset construction.
- Score: 3.770155074442168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper provides guidance for building and maintaining infrastructure for participatory AI efforts by sharing reflections on building World Wide Dishes (WWD), a bottom-up, community-led image and text dataset of culinary dishes and associated cultural customs. We present WWD as an example of participatory dataset creation, where community members both guide the design of the research process and contribute to the crowdsourced dataset. This approach incorporates localised expertise and knowledge to address the limitations of web-scraped Internet datasets acknowledged in the Participatory AI discourse. We show that our approach can result in curated, high-quality data that supports decentralised contributions from communities that do not typically contribute to datasets due to a variety of systemic factors. Our project demonstrates the importance of participatory mediators in supporting community engagement by identifying the kinds of labour they performed to make WWD possible. We surface three dimensions of labour performed by participatory mediators that are crucial for participatory dataset construction: building trust with community members, making participation accessible, and contextualising community values to support meaningful data collection. Drawing on our findings, we put forth five lessons for building infrastructure to support future participatory AI efforts.
Related papers
- Running a Data Integration Lab in the Context of the EHRI Project: Challenges, Lessons Learnt and Future Directions [0.0]
The EHRI project set out to build a trans-national network of archives, researchers, and digital practitioners to mitigate this problem.<n>One of its main outcomes was the creation of the EHRI Portal, a "virtual observatory" that gathers in one centralised platform descriptions of Holocaust-related archival sources from around the world.<n>In order to build the Portal a strong data identification and integration effort was required, culminating in the project's third phase with the creation of the EHRI-3 data integration lab.
arXiv Detail & Related papers (2025-05-05T08:39:18Z) - Amplify Initiative: Building A Localized Data Platform for Globalized AI [3.045104054104307]
Current AI models often fail to account for local context and language, given the predominance of English and Western internet content in their training data.<n>Amplify Initiative, a data platform and methodology, leverages expert communities to collect diverse, high-quality data to address the limitations of these models.<n>The platform is designed to enable co-creation of datasets, provide access to high-quality multilingual datasets, and offer recognition to data authors.
arXiv Detail & Related papers (2025-04-18T23:20:52Z) - From Community Network to Community Data: Towards Combining Data Pool and Data Cooperative for Data Justice in Rural Areas [0.0]
This study explores the shift from community networks (CNs) to community data in rural areas.<n>It focuses on combining data pools and data cooperatives to achieve data justice and foster and a just AI ecosystem.
arXiv Detail & Related papers (2025-03-07T21:41:01Z) - Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots.<n>It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing.<n>We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z) - Deploying Large Language Models With Retrieval Augmented Generation [0.21485350418225244]
Retrieval Augmented Generation has emerged as a key approach for integrating knowledge from data sources outside of the large language model's training set.
We present insights from the development and field-testing of a pilot project that integrates LLMs with RAG for information retrieval.
arXiv Detail & Related papers (2024-11-07T22:11:51Z) - Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - Social Intelligence Data Infrastructure: Structuring the Present and Navigating the Future [59.78608958395464]
We build a Social AI Data Infrastructure, which consists of a comprehensive social AI taxonomy and a data library of 480 NLP datasets.
Our infrastructure allows us to analyze existing dataset efforts, and also evaluate language models' performance in different social intelligence aspects.
We show there is a need for multifaceted datasets, increased diversity in language and culture, more long-tailed social situations, and more interactive data in future social intelligence data efforts.
arXiv Detail & Related papers (2024-02-28T00:22:42Z) - Massively Multi-Cultural Knowledge Acquisition & LM Benchmarking [48.21982147529661]
This paper introduces a novel approach for massively multicultural knowledge acquisition.
Our method strategically navigates from densely informative Wikipedia documents on cultural topics to an extensive network of linked pages.
Our work marks an important step towards deeper understanding and bridging the gaps of cultural disparities in AI.
arXiv Detail & Related papers (2024-02-14T18:16:54Z) - Unveiling Diversity: Empowering OSS Project Leaders with Community
Diversity and Turnover Dashboards [51.67585198094836]
CommunityTapestry is a dynamic real-time community dashboard.
It presents key diversity and turnover signals that we identified from the literature.
It helped project leaders identify areas of improvement and gave them actionable information.
arXiv Detail & Related papers (2023-12-13T22:12:57Z) - CommunityAI: Towards Community-based Federated Learning [6.535815174238974]
We present a novel framework for Community-based Federated Learning called CommunityAI.
CommunityAI enables participants to be organized into communities based on their shared interests, expertise, or data characteristics.
We discuss the conceptual architecture, system requirements, processes, and future challenges that must be solved.
arXiv Detail & Related papers (2023-11-29T09:31:52Z) - The Dimensions of Data Labor: A Road Map for Researchers, Activists, and
Policymakers to Empower Data Producers [14.392208044851976]
Data producers have little say in what data is captured, how it is used, or who it benefits.
Organizations with the ability to access and process this data, e.g. OpenAI and Google, possess immense power in shaping the technology landscape.
By synthesizing related literature that reconceptualizes the production of data for computing as data labor'', we outline opportunities for researchers, policymakers, and activists to empower data producers.
arXiv Detail & Related papers (2023-05-22T17:11:22Z) - Contributing to Accessibility Datasets: Reflections on Sharing Study
Data by Blind People [14.625384963263327]
We present a pair of studies where 13 blind participants engage in data capturing activities.
We see how different factors influence blind participants' willingness to share study data as they assess risk-benefit tradeoffs.
The majority support sharing of their data to improve technology but also express concerns over commercial use, associated metadata, and the lack of transparency about the impact of their data.
arXiv Detail & Related papers (2023-03-09T00:42:18Z) - Data-centric AI: Perspectives and Challenges [51.70828802140165]
Data-centric AI (DCAI) advocates a fundamental shift from model advancements to ensuring data quality and reliability.
We bring together three general missions: training data development, inference data development, and data maintenance.
arXiv Detail & Related papers (2023-01-12T05:28:59Z) - TRoVE: Transforming Road Scene Datasets into Photorealistic Virtual
Environments [84.6017003787244]
This work proposes a synthetic data generation pipeline to address the difficulties and domain-gaps present in simulated datasets.
We show that using annotations and visual cues from existing datasets, we can facilitate automated multi-modal data generation.
arXiv Detail & Related papers (2022-08-16T20:46:08Z) - DataPerf: Benchmarks for Data-Centric AI Development [81.03754002516862]
DataPerf is a community-led benchmark suite for evaluating ML datasets and data-centric algorithms.
We provide an open, online platform with multiple rounds of challenges to support this iterative development.
The benchmarks, online evaluation platform, and baseline implementations are open source.
arXiv Detail & Related papers (2022-07-20T17:47:54Z) - Documenting Data Production Processes: A Participatory Approach for Data
Work [4.811554861191618]
opacity of machine learning data is a significant threat to ethical data work and intelligible systems.
Previous research has proposed standardized checklists to document datasets.
This paper proposes a shift of perspective: from documenting datasets toward documenting data production.
arXiv Detail & Related papers (2022-07-11T15:39:02Z) - Understanding Machine Learning Practitioners' Data Documentation
Perceptions, Needs, Challenges, and Desiderata [10.689661834716613]
Data is central to the development and evaluation of machine learning (ML) models.
To encourage responsible AI practice, researchers and practitioners have begun to advocate for increased data documentation.
There is little research on whether these data documentation frameworks meet the needs of ML practitioners.
arXiv Detail & Related papers (2022-06-06T21:55:39Z) - Data Cards: Purposeful and Transparent Dataset Documentation for
Responsible AI [0.0]
We propose Data Cards for fostering transparent, purposeful and human-centered documentation of datasets.
Data Cards are structured summaries of essential facts about various aspects of ML datasets needed by stakeholders.
We present frameworks that ground Data Cards in real-world utility and human-centricity.
arXiv Detail & Related papers (2022-04-03T13:49:36Z) - Nemo: Guiding and Contextualizing Weak Supervision for Interactive Data
Programming [77.38174112525168]
We present Nemo, an end-to-end interactive Supervision system that improves overall productivity of WS learning pipeline by an average 20% (and up to 47% in one task) compared to the prevailing WS supervision approach.
arXiv Detail & Related papers (2022-03-02T19:57:32Z) - Empowering Local Communities Using Artificial Intelligence [70.17085406202368]
It has become an important topic to explore the impact of AI on society from a people-centered perspective.
Previous works in citizen science have identified methods of using AI to engage the public in research.
This article discusses the challenges of applying AI in Community Citizen Science.
arXiv Detail & Related papers (2021-10-05T12:51:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.