Related papers: Understanding the Dataset Practitioners Behind Large Language Model Development

Understanding the Dataset Practitioners Behind Large Language Model Development

URL: http://arxiv.org/abs/2402.16611v2
Date: Mon, 1 Apr 2024 19:58:43 GMT
Title: Understanding the Dataset Practitioners Behind Large Language Model Development
Authors: Crystal Qian, Emily Reif, Minsuk Kahng,
Abstract summary: We define the role of "dataset practitioners" at a technology company, Google. We conduct semi-structured interviews with a cross-section of these practitioners. We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it.
Score: 5.48392160519422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become more advanced and impactful, it is increasingly important to scrutinize the data that they rely upon and produce. What is it to be a dataset practitioner doing this work? We approach this in two parts: first, we define the role of "dataset practitioners" by performing a retrospective analysis on the responsibilities of teams contributing to LLM development at a technology company, Google. Then, we conduct semi-structured interviews with a cross-section of these practitioners (N=10). We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it. Consequently, practitioners either rely on their own intuition or write custom code to evaluate their data. We discuss potential reasons for this phenomenon and opportunities for alignment.

Related papers

Context is Key: A Benchmark for Forecasting with Essential Textual Information [87.3175915185287]
"Context is Key" (CiK) is a forecasting benchmark that pairs numerical data with diverse types of carefully crafted textual context. We evaluate a range of approaches, including statistical models, time series foundation models, and LLM-based forecasters. We propose a simple yet effective LLM prompting method that outperforms all other tested methods on our benchmark.
arXiv Detail & Related papers (2024-10-24T17:56:08Z)
MMSci: A Dataset for Graduate-Level Multi-Discipline Multimodal Scientific Understanding [59.41495657570397]
We present a comprehensive dataset compiled from Nature Communications articles covering 72 scientific fields. We evaluated 19 proprietary and open-source models on two benchmark tasks, figure captioning and multiple-choice, and conducted human expert annotation. Fine-tuning Qwen2-VL-7B with our task-specific data achieved better performance than GPT-4o and even human experts in multiple-choice evaluations.
arXiv Detail & Related papers (2024-07-06T00:40:53Z)
Data-Centric AI in the Age of Large Language Models [51.20451986068925]
This position paper proposes a data-centric viewpoint of AI research, focusing on large language models (LLMs) We make the key observation that data is instrumental in the developmental (e.g., pretraining and fine-tuning) and inferential stages (e.g., in-context learning) of LLMs. We identify four specific scenarios centered around data, covering data-centric benchmarks and data curation, data attribution, knowledge transfer, and inference contextualization.
arXiv Detail & Related papers (2024-06-20T16:34:07Z)
An Integrated Data Processing Framework for Pretraining Foundation Models [57.47845148721817]
Researchers and practitioners often have to manually curate datasets from difference sources. We propose a data processing framework that integrates a Processing Module and an Analyzing Module. The proposed framework is easy to use and highly flexible.
arXiv Detail & Related papers (2024-02-26T07:22:51Z)
TOFU: A Task of Fictitious Unlearning for LLMs [99.92305790945507]
Large language models trained on massive corpora of data from the web can reproduce sensitive or private data raising both legal and ethical concerns. Unlearning, or tuning models to forget information present in their training data, provides us with a way to protect private data after training. We present TOFU, a benchmark aimed at helping deepen our understanding of unlearning.
arXiv Detail & Related papers (2024-01-11T18:57:12Z)
Capture the Flag: Uncovering Data Insights with Large Language Models [90.47038584812925]
This study explores the potential of using Large Language Models (LLMs) to automate the discovery of insights in data. We propose a new evaluation methodology based on a "capture the flag" principle, measuring the ability of such models to recognize meaningful and pertinent information (flags) in a dataset.
arXiv Detail & Related papers (2023-12-21T14:20:06Z)
On Responsible Machine Learning Datasets with Fairness, Privacy, and Regulatory Norms [56.119374302685934]
There have been severe concerns over the trustworthiness of AI technologies. Machine and deep learning algorithms depend heavily on the data used during their development. We propose a framework to evaluate the datasets through a responsible rubric.
arXiv Detail & Related papers (2023-10-24T14:01:53Z)
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z)
Whose AI Dream? In search of the aspiration in data annotation [12.454034525520497]
This paper investigates the work practices concerning data annotation as performed in the industry, in India. Previous investigations have largely focused on annotator subjectivity, bias and efficiency. Our results show that the work of annotators is dictated by the interests, priorities and values of others above their station.
arXiv Detail & Related papers (2022-03-21T06:28:54Z)
Whose Ground Truth? Accounting for Individual and Collective Identities Underlying Dataset Annotation [7.480972965984986]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation. We lay out the challenges in this space along two layers: who the annotator is, and how the annotators' lived experiences can impact their annotations. We put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline.
arXiv Detail & Related papers (2021-12-08T19:56:56Z)
Studying Up Machine Learning Data: Why Talk About Bias When We Mean Power? [0.0]
We argue that reducing societal problems to "bias" misses the context-based nature of data. We highlight the corporate forces and market imperatives involved in the labor of data workers that subsequently shape ML datasets.
arXiv Detail & Related papers (2021-09-16T17:38:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.