Whose AI Dream? In search of the aspiration in data annotation
- URL: http://arxiv.org/abs/2203.10748v1
- Date: Mon, 21 Mar 2022 06:28:54 GMT
- Title: Whose AI Dream? In search of the aspiration in data annotation
- Authors: Ding Wang, Shantanu Prabhat, Nithya Sambasivan
- Abstract summary: This paper investigates the work practices concerning data annotation as performed in the industry, in India.
Previous investigations have largely focused on annotator subjectivity, bias and efficiency.
Our results show that the work of annotators is dictated by the interests, priorities and values of others above their station.
- Score: 12.454034525520497
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper present the practice of data annotation from the perspective of
the annotators. Data is fundamental to ML models. This paper investigates the
work practices concerning data annotation as performed in the industry, in
India. Previous investigations have largely focused on annotator subjectivity,
bias and efficiency. We present a wider perspective of the data annotation,
following a grounded approach, we conducted three sets of interviews with 25
annotators, 10 industry experts and 12 ML practitioners. Our results show that
the work of annotators is dictated by the interests, priorities and values of
others above their station. More than technical, we contend that data
annotation is a systematic exercise of power through organizational structure
and practice. We propose a set of implications for how we can cultivate and
encourage better practice to balance the tension between the need for high
quality data at low cost and the annotator aspiration for well being, career
perspective, and active participation in building the AI dream.
Related papers
- Self-Training with Pseudo-Label Scorer for Aspect Sentiment Quad Prediction [54.23208041792073]
Aspect Sentiment Quad Prediction (ASQP) aims to predict all quads (aspect term, aspect category, opinion term, sentiment polarity) for a given review.
A key challenge in the ASQP task is the scarcity of labeled data, which limits the performance of existing methods.
We propose a self-training framework with a pseudo-label scorer, wherein a scorer assesses the match between reviews and their pseudo-labels.
arXiv Detail & Related papers (2024-06-26T05:30:21Z) - A Dataset for the Validation of Truth Inference Algorithms Suitable for Online Deployment [76.04306818209753]
We introduce a substantial crowdsourcing annotation dataset collected from a real-world crowdsourcing platform.
This dataset comprises approximately two thousand workers, one million tasks, and six million annotations.
We evaluate the effectiveness of several representative truth inference algorithms on this dataset.
arXiv Detail & Related papers (2024-03-10T16:00:41Z) - ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous
Driving [96.92499034935466]
End-to-end differentiable learning for autonomous driving has recently become a prominent paradigm.
One main bottleneck lies in its voracious appetite for high-quality labeled data.
We propose a planning-oriented active learning method which progressively annotates part of collected raw data.
arXiv Detail & Related papers (2024-03-05T11:39:07Z) - Understanding the Dataset Practitioners Behind Large Language Model Development [5.48392160519422]
We define the role of "dataset practitioners" at a technology company, Google.
We conduct semi-structured interviews with a cross-section of these practitioners.
We find that although data quality is a top priority, there is little consensus around what data quality is and how to evaluate it.
arXiv Detail & Related papers (2024-02-21T23:50:37Z) - Exploring Practitioner Perspectives On Training Data Attribution
Explanations [20.45528493625083]
We interviewed 10 practitioners to understand the possible usability of training data attribution explanations.
We found that training data quality is often the most important factor for high model performance in practice.
We urge the community to focus on the utility of TDA techniques from the human-machine collaboration perspective.
arXiv Detail & Related papers (2023-10-31T14:10:30Z) - Data-centric Artificial Intelligence: A Survey [47.24049907785989]
Recently, the role of data in AI has been significantly magnified, giving rise to the emerging concept of data-centric AI.
In this survey, we discuss the necessity of data-centric AI, followed by a holistic view of three general data-centric goals.
We believe this is the first comprehensive survey that provides a global view of a spectrum of tasks across various stages of the data lifecycle.
arXiv Detail & Related papers (2023-03-17T17:44:56Z) - Whose Ground Truth? Accounting for Individual and Collective Identities
Underlying Dataset Annotation [7.480972965984986]
We survey an array of literature that provides insights into ethical considerations around crowdsourced dataset annotation.
We lay out the challenges in this space along two layers: who the annotator is, and how the annotators' lived experiences can impact their annotations.
We put forth a concrete set of recommendations and considerations for dataset developers at various stages of the ML data pipeline.
arXiv Detail & Related papers (2021-12-08T19:56:56Z) - Interpreting Deep Knowledge Tracing Model on EdNet Dataset [67.81797777936868]
In this work, we perform the similar tasks but on a large and newly available dataset, called EdNet.
The preliminary experiment results show the effectiveness of the interpreting techniques.
arXiv Detail & Related papers (2021-10-31T07:18:59Z) - Between Subjectivity and Imposition: Power Dynamics in Data Annotation
for Computer Vision [1.933681537640272]
This paper investigates practices of image data annotation as performed in industrial contexts.
We define data annotation as a sense-making practice, where annotators assign meaning to data through the use of labels.
arXiv Detail & Related papers (2020-07-29T15:02:56Z) - How Useful is Self-Supervised Pretraining for Visual Tasks? [133.1984299177874]
We evaluate various self-supervised algorithms across a comprehensive array of synthetic datasets and downstream tasks.
Our experiments offer insights into how the utility of self-supervision changes as the number of available labels grows.
arXiv Detail & Related papers (2020-03-31T16:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.