PyTAIL: Interactive and Incremental Learning of NLP Models with Human in
the Loop for Online Data
- URL: http://arxiv.org/abs/2211.13786v1
- Date: Thu, 24 Nov 2022 20:08:15 GMT
- Title: PyTAIL: Interactive and Incremental Learning of NLP Models with Human in
the Loop for Online Data
- Authors: Shubhanshu Mishra, Jana Diesner
- Abstract summary: PyTAIL is a python library that allows a human in the loop approach to actively train NLP models.
We simulate the performance of PyTAIL on existing social media benchmark datasets for text classification.
- Score: 1.576409420083207
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Online data streams make training machine learning models hard because of
distribution shift and new patterns emerging over time. For natural language
processing (NLP) tasks that utilize a collection of features based on lexicons
and rules, it is important to adapt these features to the changing data. To
address this challenge we introduce PyTAIL, a python library, which allows a
human in the loop approach to actively train NLP models. PyTAIL enhances
generic active learning, which only suggests new instances to label by also
suggesting new features like rules and lexicons to label. Furthermore, PyTAIL
is flexible enough for users to accept, reject, or update rules and lexicons as
the model is being trained. Finally, we simulate the performance of PyTAIL on
existing social media benchmark datasets for text classification. We compare
various active learning strategies on these benchmarks. The model closes the
gap with as few as 10% of the training data. Finally, we also highlight the
importance of tracking evaluation metric on remaining data (which is not yet
merged with active learning) alongside the test dataset. This highlights the
effectiveness of the model in accurately annotating the remaining dataset,
which is especially suitable for batch processing of large unlabelled corpora.
PyTAIL will be available at https://github.com/socialmediaie/pytail.
Related papers
- Training on the Benchmark Is Not All You Need [52.01920740114261]
We propose a simple and effective data leakage detection method based on the contents of multiple-choice options.
Our method is able to work under black-box conditions without access to model training data or weights.
We evaluate the degree of data leakage of 31 mainstream open-source LLMs on four benchmark datasets.
arXiv Detail & Related papers (2024-09-03T11:09:44Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - Self-Training for Sample-Efficient Active Learning for Text Classification with Pre-Trained Language Models [3.546617486894182]
We introduce HAST, a new and effective self-training strategy, which is evaluated on four text classification benchmarks.
Results show that it outperforms the reproduced self-training approaches and reaches classification results comparable to previous experiments for three out of four datasets.
arXiv Detail & Related papers (2024-06-13T15:06:11Z) - Towards Efficient Active Learning in NLP via Pretrained Representations [1.90365714903665]
Fine-tuning Large Language Models (LLMs) is now a common approach for text classification in a wide range of applications.
We drastically expedite this process by using pretrained representations of LLMs within the active learning loop.
Our strategy yields similar performance to fine-tuning all the way through the active learning loop but is orders of magnitude less computationally expensive.
arXiv Detail & Related papers (2024-02-23T21:28:59Z) - BaSAL: Size-Balanced Warm Start Active Learning for LiDAR Semantic
Segmentation [2.9290232815049926]
Existing active learning methods overlook the severe class imbalance inherent in LiDAR semantic segmentation datasets.
We propose BaSAL, a size-balanced warm start active learning model, based on the observation that each object class has a characteristic size.
Results show that we are able to improve the performance of the initial model by a large margin.
arXiv Detail & Related papers (2023-10-12T05:03:19Z) - Stubborn Lexical Bias in Data and Models [50.79738900885665]
We use a new statistical method to examine whether spurious patterns in data appear in models trained on the data.
We apply an optimization approach to *reweight* the training data, reducing thousands of spurious correlations.
Surprisingly, though this method can successfully reduce lexical biases in the training data, we still find strong evidence of corresponding bias in the trained models.
arXiv Detail & Related papers (2023-06-03T20:12:27Z) - Iterative Loop Learning Combining Self-Training and Active Learning for
Domain Adaptive Semantic Segmentation [1.827510863075184]
Self-training and active learning have been proposed to alleviate this problem.
This paper proposes an iterative loop learning method combining Self-Training and Active Learning.
arXiv Detail & Related papers (2023-01-31T01:31:43Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Comparing Test Sets with Item Response Theory [53.755064720563]
We evaluate 29 datasets using predictions from 18 pretrained Transformer models on individual test examples.
We find that Quoref, HellaSwag, and MC-TACO are best suited for distinguishing among state-of-the-art models.
We also observe span selection task format, which is used for QA datasets like QAMR or SQuAD2.0, is effective in differentiating between strong and weak models.
arXiv Detail & Related papers (2021-06-01T22:33:53Z) - Bayesian Active Learning with Pretrained Language Models [9.161353418331245]
Active Learning (AL) is a method to iteratively select data for annotation from a pool of unlabeled data.
Previous AL approaches have been limited to task-specific models that are trained from scratch at each iteration.
We introduce BALM; Bayesian Active Learning with pretrained language models.
arXiv Detail & Related papers (2021-04-16T19:07:31Z) - Multi-layer Optimizations for End-to-End Data Analytics [71.05611866288196]
We introduce Iterative Functional Aggregate Queries (IFAQ), a framework that realizes an alternative approach.
IFAQ treats the feature extraction query and the learning task as one program given in the IFAQ's domain-specific language.
We show that a Scala implementation of IFAQ can outperform mlpack, Scikit, and specialization by several orders of magnitude for linear regression and regression tree models over several relational datasets.
arXiv Detail & Related papers (2020-01-10T16:14:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.