Training Subset Selection for Weak Supervision
- URL: http://arxiv.org/abs/2206.02914v1
- Date: Mon, 6 Jun 2022 21:31:32 GMT
- Title: Training Subset Selection for Weak Supervision
- Authors: Hunter Lang, Aravindan Vijayaraghavan, David Sontag
- Abstract summary: We show a tradeoff between the amount of weakly-labeled data and the precision of the weak labels.
We combine pretrained data representations with the cut statistic to select high-quality subsets of the weakly-labeled training data.
Using less weakly-labeled data improves the accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark tasks.
- Score: 17.03788288165262
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing weak supervision approaches use all the data covered by weak signals
to train a classifier. We show both theoretically and empirically that this is
not always optimal. Intuitively, there is a tradeoff between the amount of
weakly-labeled data and the precision of the weak labels. We explore this
tradeoff by combining pretrained data representations with the cut statistic
(Muhlenbach et al., 2004) to select (hopefully) high-quality subsets of the
weakly-labeled training data. Subset selection applies to any label model and
classifier and is very simple to plug in to existing weak supervision
pipelines, requiring just a few lines of code. We show our subset selection
method improves the performance of weak supervision for a wide range of label
models, classifiers, and datasets. Using less weakly-labeled data improves the
accuracy of weak supervision pipelines by up to 19% (absolute) on benchmark
tasks.
Related papers
- Boosting Semi-Supervised Learning by bridging high and low-confidence
predictions [4.18804572788063]
Pseudo-labeling is a crucial technique in semi-supervised learning (SSL)
We propose a new method called ReFixMatch, which aims to utilize all of the unlabeled data during training.
arXiv Detail & Related papers (2023-08-15T00:27:18Z) - Losses over Labels: Weakly Supervised Learning via Direct Loss
Construction [71.11337906077483]
Programmable weak supervision is a growing paradigm within machine learning.
We propose Losses over Labels (LoL) as it creates losses directly from ofs without going through the intermediate step of a label.
We show that LoL improves upon existing weak supervision methods on several benchmark text and image classification tasks.
arXiv Detail & Related papers (2022-12-13T22:29:14Z) - Improved Adaptive Algorithm for Scalable Active Learning with Weak
Labeler [89.27610526884496]
Weak Labeler Active Cover (WL-AC) is able to robustly leverage the lower quality weak labelers to reduce the query complexity while retaining the desired level of accuracy.
We show its effectiveness on the corrupted-MNIST dataset by significantly reducing the number of labels while keeping the same accuracy as in passive learning.
arXiv Detail & Related papers (2022-11-04T02:52:54Z) - Learned Label Aggregation for Weak Supervision [8.819582879892762]
We propose a data programming approach that aggregates weak supervision signals to generate labeled data easily.
The quality of the generated labels depends on a label aggregation model that aggregates all noisy labels from all LFs to infer the ground-truth labels.
We show the model can be trained using synthetically generated data and design an effective architecture for the model.
arXiv Detail & Related papers (2022-07-27T14:36:35Z) - Label Noise-Resistant Mean Teaching for Weakly Supervised Fake News
Detection [93.6222609806278]
We propose a novel label noise-resistant mean teaching approach (LNMT) for weakly supervised fake news detection.
LNMT leverages unlabeled news and feedback comments of users to enlarge the amount of training data.
LNMT establishes a mean teacher framework equipped with label propagation and label reliability estimation.
arXiv Detail & Related papers (2022-06-10T16:01:58Z) - Data Consistency for Weakly Supervised Learning [15.365232702938677]
Training machine learning models involves using large amounts of human-annotated data.
We propose a novel weak supervision algorithm that processes noisy labels, i.e., weak signals.
We show that it significantly outperforms state-of-the-art weak supervision methods on both text and image classification tasks.
arXiv Detail & Related papers (2022-02-08T16:48:19Z) - Dash: Semi-Supervised Learning with Dynamic Thresholding [72.74339790209531]
We propose a semi-supervised learning (SSL) approach that uses unlabeled examples to train models.
Our proposed approach, Dash, enjoys its adaptivity in terms of unlabeled data selection.
arXiv Detail & Related papers (2021-09-01T23:52:29Z) - Disentangling Sampling and Labeling Bias for Learning in Large-Output
Spaces [64.23172847182109]
We show that different negative sampling schemes implicitly trade-off performance on dominant versus rare labels.
We provide a unified means to explicitly tackle both sampling bias, arising from working with a subset of all labels, and labeling bias, which is inherent to the data due to label imbalance.
arXiv Detail & Related papers (2021-05-12T15:40:13Z) - Are Fewer Labels Possible for Few-shot Learning? [81.89996465197392]
Few-shot learning is challenging due to its very limited data and labels.
Recent studies in big transfer (BiT) show that few-shot learning can greatly benefit from pretraining on large scale labeled dataset in a different domain.
We propose eigen-finetuning to enable fewer shot learning by leveraging the co-evolution of clustering and eigen-samples in the finetuning.
arXiv Detail & Related papers (2020-12-10T18:59:29Z) - Meta-Learning for Neural Relation Classification with Distant
Supervision [38.755055486296435]
We propose a meta-learning based approach, which learns to reweight noisy training data under the guidance of reference data.
Experiments on several datasets demonstrate that the reference data can effectively guide the selection of training data.
arXiv Detail & Related papers (2020-10-26T12:52:28Z) - Constrained Labeling for Weakly Supervised Learning [15.365232702938677]
We propose a simple data-free approach for combining weak supervision signals.
Our method is efficient and stable, converging after a few iterations of descent.
We show experimentally that our method outperforms other weak supervision methods on various text- and image-classification tasks.
arXiv Detail & Related papers (2020-09-15T21:30:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.