Active learning for reducing labeling effort in text classification
tasks
- URL: http://arxiv.org/abs/2109.04847v1
- Date: Fri, 10 Sep 2021 13:00:36 GMT
- Title: Active learning for reducing labeling effort in text classification
tasks
- Authors: Pieter Floris Jacobs, Gideon Maillette de Buy Wenniger, Marco Wiering,
Lambert Schomaker
- Abstract summary: Active learning (AL) is a paradigm that aims to reduce labeling effort by only using the data which the used model deems most informative.
We present an empirical study that compares different uncertainty-based algorithms BERT$_base$ as the used classifiers.
Our results show that using uncertainty-based AL with BERT$base$ outperforms random sampling of data.
- Score: 3.8424737607413153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Labeling data can be an expensive task as it is usually performed manually by
domain experts. This is cumbersome for deep learning, as it is dependent on
large labeled datasets. Active learning (AL) is a paradigm that aims to reduce
labeling effort by only using the data which the used model deems most
informative. Little research has been done on AL in a text classification
setting and next to none has involved the more recent, state-of-the-art NLP
models. Here, we present an empirical study that compares different
uncertainty-based algorithms with BERT$_{base}$ as the used classifier. We
evaluate the algorithms on two NLP classification datasets: Stanford Sentiment
Treebank and KvK-Frontpages. Additionally, we explore heuristics that aim to
solve presupposed problems of uncertainty-based AL; namely, that it is
unscalable and that it is prone to selecting outliers. Furthermore, we explore
the influence of the query-pool size on the performance of AL. Whereas it was
found that the proposed heuristics for AL did not improve performance of AL;
our results show that using uncertainty-based AL with BERT$_{base}$ outperforms
random sampling of data. This difference in performance can decrease as the
query-pool size gets larger.
Related papers
- MyriadAL: Active Few Shot Learning for Histopathology [10.652626309100889]
We introduce an active few shot learning framework, Myriad Active Learning (MAL)
MAL includes a contrastive-learning encoder, pseudo-label generation, and novel query sample selection in the loop.
Experiments on two public histopathology datasets show that MAL has superior test accuracy, macro F1-score, and label efficiency compared to prior works.
arXiv Detail & Related papers (2023-10-24T20:08:15Z) - XAL: EXplainable Active Learning Makes Classifiers Better Low-resource Learners [71.8257151788923]
We propose a novel Explainable Active Learning framework (XAL) for low-resource text classification.
XAL encourages classifiers to justify their inferences and delve into unlabeled data for which they cannot provide reasonable explanations.
Experiments on six datasets show that XAL achieves consistent improvement over 9 strong baselines.
arXiv Detail & Related papers (2023-10-09T08:07:04Z) - MoBYv2AL: Self-supervised Active Learning for Image Classification [57.4372176671293]
We present MoBYv2AL, a novel self-supervised active learning framework for image classification.
Our contribution lies in lifting MoBY, one of the most successful self-supervised learning algorithms, to the AL pipeline.
We achieve state-of-the-art results when compared to recent AL methods.
arXiv Detail & Related papers (2023-01-04T10:52:02Z) - Is margin all you need? An extensive empirical study of active learning
on tabular data [66.18464006872345]
We analyze the performance of a variety of active learning algorithms on 69 real-world datasets from the OpenML-CC18 benchmark.
Surprisingly, we find that the classical margin sampling technique matches or outperforms all others, including current state-of-art.
arXiv Detail & Related papers (2022-10-07T21:18:24Z) - Pareto Optimization for Active Learning under Out-of-Distribution Data
Scenarios [79.02009938011447]
We propose a sampling scheme, which selects optimal subsets of unlabeled samples with fixed batch size from the unlabeled data pool.
Experimental results show its effectiveness on both classical Machine Learning (ML) and Deep Learning (DL) tasks.
arXiv Detail & Related papers (2022-07-04T04:11:44Z) - Active Learning by Feature Mixing [52.16150629234465]
We propose a novel method for batch active learning called ALFA-Mix.
We identify unlabelled instances with sufficiently-distinct features by seeking inconsistencies in predictions.
We show that inconsistencies in these predictions help discovering features that the model is unable to recognise in the unlabelled instances.
arXiv Detail & Related papers (2022-03-14T12:20:54Z) - Effective Evaluation of Deep Active Learning on Image Classification
Tasks [10.27095298129151]
We present a unified re-implementation of state-of-the-art active learning algorithms in the context of image classification.
On the positive side, we show that AL techniques are 2x to 4x more label-efficient compared to RS with the use of data augmentation.
arXiv Detail & Related papers (2021-06-16T23:29:39Z) - Can Active Learning Preemptively Mitigate Fairness Issues? [66.84854430781097]
dataset bias is one of the prevailing causes of unfairness in machine learning.
We study whether models trained with uncertainty-based ALs are fairer in their decisions with respect to a protected class.
We also explore the interaction of algorithmic fairness methods such as gradient reversal (GRAD) and BALD.
arXiv Detail & Related papers (2021-04-14T14:20:22Z) - MetAL: Active Semi-Supervised Learning on Graphs via Meta Learning [2.903711704663904]
We propose MetAL, an AL approach that selects unlabeled instances that directly improve the future performance of a classification model.
We demonstrate that MetAL efficiently outperforms existing state-of-the-art AL algorithms.
arXiv Detail & Related papers (2020-07-22T06:59:49Z) - Towards Robust and Reproducible Active Learning Using Neural Networks [15.696979318409392]
Active learning (AL) is a promising ML paradigm that has the potential to parse through large unlabeled data.
Recently proposed neural network based AL methods help reduce annotation cost in domains where labeling data can be prohibitive.
In this study, we demonstrate that different types of AL algorithms produce inconsistent gain over random sampling baseline.
arXiv Detail & Related papers (2020-02-21T22:01:47Z) - Fase-AL -- Adaptation of Fast Adaptive Stacking of Ensembles for
Supporting Active Learning [0.0]
This work presents the FASE-AL algorithm which induces classification models with non-labeled instances using Active Learning.
The algorithm achieves promising results in terms of the percentage of correctly classified instances.
arXiv Detail & Related papers (2020-01-30T17:25:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.