The Word is Mightier than the Label: Learning without Pointillistic
Labels using Data Programming
- URL: http://arxiv.org/abs/2108.10921v2
- Date: Thu, 26 Aug 2021 00:31:10 GMT
- Title: The Word is Mightier than the Label: Learning without Pointillistic
Labels using Data Programming
- Authors: Chufan Gao and Mononito Goswami
- Abstract summary: Most advanced supervised Machine Learning (ML) models rely on vast amounts of point-by-point labelled training examples.
Hand-labelling vast amounts of data may be tedious, expensive, and error-prone.
- Score: 11.536162323162099
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most advanced supervised Machine Learning (ML) models rely on vast amounts of
point-by-point labelled training examples. Hand-labelling vast amounts of data
may be tedious, expensive, and error-prone. Recently, some studies have
explored the use of diverse sources of weak supervision to produce competitive
end model classifiers. In this paper, we survey recent work on weak
supervision, and in particular, we investigate the Data Programming (DP)
framework. Taking a set of potentially noisy heuristics as input, DP assigns
denoised probabilistic labels to each data point in a dataset using a
probabilistic graphical model of heuristics. We analyze the math fundamentals
behind DP and demonstrate the power of it by applying it on two real-world text
classification tasks. Furthermore, we compare DP with pointillistic active and
semi-supervised learning techniques traditionally applied in data-sparse
settings.
Related papers
- Minimally Supervised Learning using Topological Projections in
Self-Organizing Maps [55.31182147885694]
We introduce a semi-supervised learning approach based on topological projections in self-organizing maps (SOMs)
Our proposed method first trains SOMs on unlabeled data and then a minimal number of available labeled data points are assigned to key best matching units (BMU)
Our results indicate that the proposed minimally supervised model significantly outperforms traditional regression techniques.
arXiv Detail & Related papers (2024-01-12T22:51:48Z) - Deep Partial Multi-Label Learning with Graph Disambiguation [27.908565535292723]
We propose a novel deep Partial multi-Label model with grAph-disambIguatioN (PLAIN)
Specifically, we introduce the instance-level and label-level similarities to recover label confidences.
At each training epoch, labels are propagated on the instance and label graphs to produce relatively accurate pseudo-labels.
arXiv Detail & Related papers (2023-05-10T04:02:08Z) - A Benchmark Generative Probabilistic Model for Weak Supervised Learning [2.0257616108612373]
Weak Supervised Learning approaches have been developed to alleviate the annotation burden.
We show that latent variable models (PLVMs) achieve state-of-the-art performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:06:24Z) - Learned Label Aggregation for Weak Supervision [8.819582879892762]
We propose a data programming approach that aggregates weak supervision signals to generate labeled data easily.
The quality of the generated labels depends on a label aggregation model that aggregates all noisy labels from all LFs to infer the ground-truth labels.
We show the model can be trained using synthetically generated data and design an effective architecture for the model.
arXiv Detail & Related papers (2022-07-27T14:36:35Z) - Data Consistency for Weakly Supervised Learning [15.365232702938677]
Training machine learning models involves using large amounts of human-annotated data.
We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals.
We show that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.
arXiv Detail & Related papers (2022-02-08T16:48:19Z) - AutoGeoLabel: Automated Label Generation for Geospatial Machine Learning [69.47585818994959]
We evaluate a big data processing pipeline to auto-generate labels for remote sensing data.
We utilize the big geo-data platform IBM PAIRS to dynamically generate such labels in dense urban areas.
arXiv Detail & Related papers (2022-01-31T20:02:22Z) - Improving Contrastive Learning on Imbalanced Seed Data via Open-World
Sampling [96.8742582581744]
We present an open-world unlabeled data sampling framework called Model-Aware K-center (MAK)
MAK follows three simple principles: tailness, proximity, and diversity.
We demonstrate that MAK can consistently improve both the overall representation quality and the class balancedness of the learned features.
arXiv Detail & Related papers (2021-11-01T15:09:41Z) - Few-shot Learning via Dependency Maximization and Instance Discriminant
Analysis [21.8311401851523]
We study the few-shot learning problem, where a model learns to recognize new objects with extremely few labeled data per category.
We propose a simple approach to exploit unlabeled data accompanying the few-shot task for improving few-shot performance.
arXiv Detail & Related papers (2021-09-07T02:19:01Z) - Combining Feature and Instance Attribution to Detect Artifacts [62.63504976810927]
We propose methods to facilitate identification of training data artifacts.
We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data.
We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice.
arXiv Detail & Related papers (2021-07-01T09:26:13Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.