Learning to Robustly Aggregate Labeling Functions for Semi-supervised
Data Programming
- URL: http://arxiv.org/abs/2109.11410v1
- Date: Thu, 23 Sep 2021 14:42:46 GMT
- Title: Learning to Robustly Aggregate Labeling Functions for Semi-supervised
Data Programming
- Authors: Ayush Maheshwari, Krishnateja Killamsetty, Ganesh Ramakrishnan,
Rishabh Iyer, Marina Danilevsky and Lucian Popa
- Abstract summary: A critical bottleneck in supervised machine learning is the need for large amounts of labeled data.
In this work, we propose an LF based reweighting framework ouralgo to solve these two critical limitations.
Our algorithm learns a joint model on the (same) labeled dataset used for LF induction along with any unlabeled data in a semi-supervised manner.
- Score: 14.639568384768042
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A critical bottleneck in supervised machine learning is the need for large
amounts of labeled data which is expensive and time consuming to obtain.
However, it has been shown that a small amount of labeled data, while
insufficient to re-train a model, can be effectively used to generate
human-interpretable labeling functions (LFs). These LFs, in turn, have been
used to generate a large amount of additional noisy labeled data, in a paradigm
that is now commonly referred to as data programming. However, previous
approaches to automatically generate LFs make no attempt to further use the
given labeled data for model training, thus giving up opportunities for
improved performance. Moreover, since the LFs are generated from a relatively
small labeled dataset, they are prone to being noisy, and naively aggregating
these LFs can lead to very poor performance in practice. In this work, we
propose an LF based reweighting framework \ouralgo{} to solve these two
critical limitations. Our algorithm learns a joint model on the (same) labeled
dataset used for LF induction along with any unlabeled data in a
semi-supervised manner, and more critically, reweighs each LF according to its
goodness, influencing its contribution to the semi-supervised loss using a
robust bi-level optimization algorithm. We show that our algorithm
significantly outperforms prior approaches on several text classification
datasets.
Related papers
- Learning with Less: Knowledge Distillation from Large Language Models via Unlabeled Data [54.934578742209716]
In real-world NLP applications, Large Language Models (LLMs) offer promising solutions due to their extensive training on vast datasets.
LLKD is an adaptive sample selection method that incorporates signals from both the teacher and student.
Our comprehensive experiments show that LLKD achieves superior performance across various datasets with higher data efficiency.
arXiv Detail & Related papers (2024-11-12T18:57:59Z) - Aligning Large Language Models with Self-generated Preference Data [72.99676237703099]
We propose a new framework that boosts the alignment of large language models (LLMs) with human preferences.
Our key idea is leveraging the human prior knowledge within the small (seed) data.
We introduce a noise-aware preference learning algorithm to mitigate the risk of low quality within generated preference data.
arXiv Detail & Related papers (2024-06-06T18:01:02Z) - Self-Supervised Dataset Distillation for Transfer Learning [77.4714995131992]
We propose a novel problem of distilling an unlabeled dataset into a set of small synthetic samples for efficient self-supervised learning (SSL)
We first prove that a gradient of synthetic samples with respect to a SSL objective in naive bilevel optimization is textitbiased due to randomness originating from data augmentations or masking.
We empirically validate the effectiveness of our method on various applications involving transfer learning.
arXiv Detail & Related papers (2023-10-10T10:48:52Z) - A Benchmark Generative Probabilistic Model for Weak Supervised Learning [2.0257616108612373]
Weak Supervised Learning approaches have been developed to alleviate the annotation burden.
We show that latent variable models (PLVMs) achieve state-of-the-art performance across four datasets.
arXiv Detail & Related papers (2023-03-31T07:06:24Z) - Learned Label Aggregation for Weak Supervision [8.819582879892762]
We propose a data programming approach that aggregates weak supervision signals to generate labeled data easily.
The quality of the generated labels depends on a label aggregation model that aggregates all noisy labels from all LFs to infer the ground-truth labels.
We show the model can be trained using synthetically generated data and design an effective architecture for the model.
arXiv Detail & Related papers (2022-07-27T14:36:35Z) - ULF: Unsupervised Labeling Function Correction using Cross-Validation
for Weak Supervision [5.566060402907773]
Weak supervision (WS) is a cost-effective alternative to manual data labeling.
We introduce a new algorithm ULF for Unsupervised Labeling Function correction.
ULF refines the allocation of LFs to classes by re-estimating this assignment on highly reliable cross-validated samples.
arXiv Detail & Related papers (2022-04-14T10:29:01Z) - Label Augmentation with Reinforced Labeling for Weak Supervision [0.1529342790344802]
This paper proposes a new approach called reinforced labeling (RL)
RL augments the LFs' outputs to cases not covered by LFs based on similarities among samples.
Experiments on several domains (classification of YouTube comments, wine quality, and weather prediction) result in considerable gains.
arXiv Detail & Related papers (2022-04-13T14:54:02Z) - Relieving the Plateau: Active Semi-Supervised Learning for a Better
Landscape [2.3046646540823916]
Semi-supervised learning (SSL) leverages unlabeled data that are more accessible than their labeled counterparts.
Active learning (AL) selects unlabeled instances to be annotated by a human-in-the-loop in hopes of better performance with less labeled data.
We propose convergence rate control (CRC), an AL algorithm that selects unlabeled data to improve the problem conditioning upon inclusion to the labeled set.
arXiv Detail & Related papers (2021-04-08T06:03:59Z) - Self-Tuning for Data-Efficient Deep Learning [75.34320911480008]
Self-Tuning is a novel approach to enable data-efficient deep learning.
It unifies the exploration of labeled and unlabeled data and the transfer of a pre-trained model.
It outperforms its SSL and TL counterparts on five tasks by sharp margins.
arXiv Detail & Related papers (2021-02-25T14:56:19Z) - Semi-Supervised Learning with Meta-Gradient [123.26748223837802]
We propose a simple yet effective meta-learning algorithm in semi-supervised learning.
We find that the proposed algorithm performs favorably against state-of-the-art methods.
arXiv Detail & Related papers (2020-07-08T08:48:56Z) - Learning to Count in the Crowd from Limited Labeled Data [109.2954525909007]
We focus on reducing the annotation efforts by learning to count in the crowd from limited number of labeled samples.
Specifically, we propose a Gaussian Process-based iterative learning mechanism that involves estimation of pseudo-ground truth for the unlabeled data.
arXiv Detail & Related papers (2020-07-07T04:17:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.