Understanding the Dataset Practitioners Behind Large Language Model Development
- URL: http://arxiv.org/abs/2402.16611v2
- Date: Mon, 1 Apr 2024 19:58:43 GMT
- Title: Understanding the Dataset Practitioners Behind Large Language Model Development
- Authors: Crystal Qian, Emily Reif, Minsuk Kahng,
- Abstract summary: We define the role of "dataset practitioners" at a technology company, Google.
We conduct semi-structured interviews with a cross-section of these practitioners.
We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it.
- Score: 5.48392160519422
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.
Related papers
- Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs)
We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs.
We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z) - An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources.
We propose a data processing framework that integrates a Processing Module and an Analyzing Module.
The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z) - TOFU: A Task of Fictitious Unlearning for LLMs [99.92305790945507]
Large language models trained on massive corpora of data from the web can reproduce sensitive or private data raising both legal and ethical concerns.
Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training.
We present TOFU, a benchmark aimed at helping deepen our understanding of unlearning.
arXiv Detail & Related papers (2024-01-11T18:57:12Z) - Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data.
We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z) - On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies.
Machine and deep learning algorithms depend heavily on the data used during their development.
We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z) - A Pretrainer's Guide to Training Data: Measuring the Effects of Data
Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining.
Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z) - Whose AI Dream? In search of the aspiration in data annotation [12.454034525520497]
This paper investigates the work practices concerning data annotation as performed in the industry, in India.
Previous investigations have largely focused on annotator subjectivity, bias and efficiency.
Our results show that the work of annotators is dictated by the interests, priorities and values of others above their station.
arXiv Detail & Related papers (2022-03-21T06:28:54Z) - Whose Ground Truth? Accounting for Individual and Collective Identities
Underlying Dataset Annotation [7.480972965984986]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: who the annotator is, and how the annotators' lived experiences can impact their annotations.
We put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline.
arXiv Detail & Related papers (2021-12-08T19:56:56Z) - Studying Up Machine Learning Data: Why Talk About Bias When We Mean
Power? [0.0]
We argue that reducing societal problems to "bias" misses the context-based nature of data.
We highlight the corporate forces and market imperatives involved in the labor of data workers that subsequently shape ML datasets.
arXiv Detail & Related papers (2021-09-16T17:38:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.