"Garbage In, Garbage Out" Revisited: What Do Machine Learning
Application Papers Report About Human-Labeled Training Data?
- URL: http://arxiv.org/abs/2107.02278v1
- Date: Mon, 5 Jul 2021 21:24:02 GMT
- Title: "Garbage In, Garbage Out" Revisited: What Do Machine Learning
Application Papers Report About Human-Labeled Training Data?
- Authors: R. Stuart Geiger, Dominique Cope, Jamie Ip, Marsha Lotosh, Aayush
Shah, Jenny Weng, Rebekah Tang
- Abstract summary: Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data.
This study builds on prior work that investigated to what extent 'best practices' around labeling training data were followed in applied ML publications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Supervised machine learning, in which models are automatically derived from
labeled training data, is only as good as the quality of that data. This study
builds on prior work that investigated to what extent 'best practices' around
labeling training data were followed in applied ML publications within a single
domain (social media platforms). In this paper, we expand by studying
publications that apply supervised ML in a far broader spectrum of disciplines,
focusing on human-labeled data. We report to what extent a random sample of ML
application papers across disciplines give specific details about whether best
practices were followed, while acknowledging that a greater range of
application fields necessarily produces greater diversity of labeling and
annotation methods. Because much of machine learning research and education
only focuses on what is done once a "ground truth" or "gold standard" of
training data is available, it is especially relevant to discuss issues around
the equally-important aspect of whether such data is reliable in the first
place. This determination becomes increasingly complex when applied to a
variety of specialized fields, as labeling can range from a task requiring
little-to-no background knowledge to one that must be performed by someone with
career expertise.
Related papers
- Analyzing Dataset Annotation Quality Management in the Wild [63.07224587146207]
Even popular datasets used to train and evaluate state-of-the-art models contain a non-negligible amount of erroneous annotations, biases, or artifacts.
While practices and guidelines regarding dataset creation projects exist, large-scale analysis has yet to be performed on how quality management is conducted.
arXiv Detail & Related papers (2023-07-16T21:22:40Z) - Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training [20.98770732015944]
Few-shot intent detection involves training a deep learning model to classify utterances based on their underlying intents using only a small amount of labeled data.
We show that continual pre-training may not be essential, since the overfitting problem of PLMs on this task may not be as serious as expected.
To maximize the utilization of the limited available data, we propose a context augmentation method and leverage sequential self-distillation to boost performance.
arXiv Detail & Related papers (2023-06-08T15:26:52Z) - Learning Instructions with Unlabeled Data for Zero-Shot Cross-Task
Generalization [68.91386402390403]
We propose Unlabeled Data Augmented Instruction Tuning (UDIT) to take better advantage of the instructions during instruction learning.
We conduct extensive experiments to show UDIT's effectiveness in various scenarios of tasks and datasets.
arXiv Detail & Related papers (2022-10-17T15:25:24Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - The Word is Mightier than the Label: Learning without Pointillistic
Labels using Data Programming [11.536162323162099]
Most advanced supervised Machine Learning (ML) models rely on vast amounts of point-by-point labelled training examples.
Hand-labelling vast amounts of data may be tedious, expensive, and error-prone.
arXiv Detail & Related papers (2021-08-24T19:11:28Z) - Streaming Self-Training via Domain-Agnostic Unlabeled Images [62.57647373581592]
We present streaming self-training (SST) that aims to democratize the process of learning visual recognition models.
Key to SST are two crucial observations: (1) domain-agnostic unlabeled images enable us to learn better models with a few labeled examples without any additional knowledge or supervision; and (2) learning is a continuous process and can be done by constructing a schedule of learning updates.
arXiv Detail & Related papers (2021-04-07T17:58:39Z) - A Survey on Deep Learning with Noisy Labels: How to train your model
when you cannot trust on the annotations? [21.562089974755125]
Several approaches have been proposed to improve the training of deep learning models in the presence of noisy labels.
This paper presents a survey on the main techniques in literature, in which we classify the algorithm in the following groups: robust losses, sample weighting, sample selection, meta-learning, and combined approaches.
arXiv Detail & Related papers (2020-12-05T15:45:20Z) - DAGA: Data Augmentation with a Generation Approach for Low-resource
Tagging Tasks [88.62288327934499]
We propose a novel augmentation method with language models trained on the linearized labeled sentences.
Our method is applicable to both supervised and semi-supervised settings.
arXiv Detail & Related papers (2020-11-03T07:49:15Z) - Principles and Practice of Explainable Machine Learning [12.47276164048813]
This report focuses on data-driven methods -- machine learning (ML) and pattern recognition models in particular.
With the increasing prevalence and complexity of methods, business stakeholders in the very least have a growing number of concerns about the drawbacks of models.
We have undertaken a survey to help industry practitioners understand the field of explainable machine learning better.
arXiv Detail & Related papers (2020-09-18T14:50:27Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.