skweak: Weak Supervision Made Easy for NLP
- URL: http://arxiv.org/abs/2104.09683v1
- Date: Mon, 19 Apr 2021 23:26:51 GMT
- Title: skweak: Weak Supervision Made Easy for NLP
- Authors: Pierre Lison and Jeremy Barnes and Aliaksandr Hubin
- Abstract summary: We present skweak, a Python-based software toolkit enabling NLP developers to apply weak supervision to a wide range of NLP tasks.
We use labelling functions derived from domain knowledge to automatically obtain annotations for a given dataset.
The resulting labels are then aggregated with a generative model that estimates the accuracy (and possible confusions) of each labelling function.
- Score: 13.37847225239485
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present skweak, a versatile, Python-based software toolkit enabling NLP
developers to apply weak supervision to a wide range of NLP tasks. Weak
supervision is an emerging machine learning paradigm based on a simple idea:
instead of labelling data points by hand, we use labelling functions derived
from domain knowledge to automatically obtain annotations for a given dataset.
The resulting labels are then aggregated with a generative model that estimates
the accuracy (and possible confusions) of each labelling function. The skweak
toolkit makes it easy to implement a large spectrum of labelling functions
(such as heuristics, gazetteers, neural models or linguistic constraints) on
text data, apply them on a corpus, and aggregate their results in a fully
unsupervised fashion. skweak is especially designed to facilitate the use of
weak supervision for NLP tasks such as text classification and sequence
labelling. We illustrate the use of skweak for NER and sentiment analysis.
skweak is released under an open-source license and is available at:
https://github.com/NorskRegnesentral/skweak
Related papers
- Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases.
Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding.
This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z) - AutoWS: Automated Weak Supervision Framework for Text Classification [1.748907524043535]
We propose a novel framework for increasing the efficiency of weak supervision process while decreasing the dependency on domain experts.
Our method requires a small set of labeled examples per label class and automatically creates a set of labeling functions to assign noisy labels to numerous unlabeled data.
arXiv Detail & Related papers (2023-02-07T07:12:05Z) - SciAnnotate: A Tool for Integrating Weak Labeling Sources for Sequence
Labeling [55.71459234749639]
SciAnnotate is a web-based tool for text annotation called SciAnnotate, which stands for scientific annotation tool.
Our tool provides users with multiple user-friendly interfaces for creating weak labels.
In this study, we take multi-source weak label denoising as an example, we utilized a Bertifying Conditional Hidden Markov Model to denoise the weak label generated by our tool.
arXiv Detail & Related papers (2022-08-07T19:18:13Z) - Binary Classification with Positive Labeling Sources [71.37692084951355]
We propose WEAPO, a simple yet competitive WS method for producing training labels without negative labeling sources.
We show WEAPO achieves the highest averaged performance on 10 benchmark datasets.
arXiv Detail & Related papers (2022-08-02T19:32:08Z) - Trustable Co-label Learning from Multiple Noisy Annotators [68.59187658490804]
Supervised deep learning depends on massive accurately annotated examples.
A typical alternative is learning from multiple noisy annotators.
This paper proposes a data-efficient approach, called emphTrustable Co-label Learning (TCL)
arXiv Detail & Related papers (2022-03-08T16:57:00Z) - Automatic Synthesis of Diverse Weak Supervision Sources for Behavior
Analysis [37.077883083886114]
AutoSWAP is a framework for automatically synthesizing data-efficient task-level labeling functions.
We show that AutoSWAP is an effective way to automatically generate labeling functions that can significantly reduce expert effort for behavior analysis.
arXiv Detail & Related papers (2021-11-30T07:51:12Z) - TagRuler: Interactive Tool for Span-Level Data Programming by
Demonstration [1.4050836886292872]
Data programming was only accessible to users who knew how to program.
We build a novel tool, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming.
arXiv Detail & Related papers (2021-06-24T04:49:42Z) - Denoising Multi-Source Weak Supervision for Neural Text Classification [9.099703420721701]
We study the problem of learning neural text classifiers without using any labeled data, but only easy-to-provide rules as multiple weak supervision sources.
This problem is challenging because rule-induced weak labels are often noisy and incomplete.
We design a label denoiser, which estimates the source reliability using a conditional soft attention mechanism and then reduces label noise by aggregating rule-annotated weak labels.
arXiv Detail & Related papers (2020-10-09T13:57:52Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Reliable Label Bootstrapping for Semi-Supervised Learning [19.841733658911767]
ReLaB is an unsupervised preprossessing algorithm which improves the performance of semi-supervised algorithms in extremely low supervision settings.
We show that the selection of the network architecture and the self-supervised algorithm are important factors to achieve successful label propagation.
We reach average error rates of $boldsymbol22.34$ with 1 random labeled sample per class on CIFAR-10 and lower this error to $boldsymbol8.46$ when the labeled sample in each class is highly representative.
arXiv Detail & Related papers (2020-07-23T08:51:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.