Improving Probabilistic Models in Text Classification via Active
Learning
- URL: http://arxiv.org/abs/2202.02629v1
- Date: Sat, 5 Feb 2022 20:09:26 GMT
- Title: Improving Probabilistic Models in Text Classification via Active
Learning
- Authors: Mitchell Bosley, Saki Kuzushima, Ted Enamorado, Yuki Shiraito
- Abstract summary: We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component.
We show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: When using text data, social scientists often classify documents in order to
use the resulting document labels as an outcome or predictor. Since it is
prohibitively costly to label a large number of documents manually, automated
text classification has become a standard tool. However, current approaches for
text classification do not take advantage of all the data at one's disposal. We
propose a fast new model for text classification that combines information from
both labeled and unlabeled data with an active learning component, where a
human iteratively labels documents that the algorithm is least certain about.
Using text data from Wikipedia discussion pages, BBC News articles, historical
US Supreme Court opinions, and human rights abuse allegations, we show that by
introducing information about the structure of unlabeled data and iteratively
labeling uncertain documents, our model improves performance relative to
classifiers that (a) only use information from labeled data and (b) randomly
decide which documents to label at the cost of manually labelling a small
number of documents.
Related papers
- Open-world Multi-label Text Classification with Extremely Weak Supervision [30.85235057480158]
We study open-world multi-label text classification under extremely weak supervision (XWS)
We first utilize the user description to prompt a large language model (LLM) for dominant keyphrases of a subset of raw documents, and then construct a label space via clustering.
We then apply a zero-shot multi-label classifier to locate the documents with small top predicted scores, so we can revisit their dominant keyphrases for more long-tail labels.
X-MLClass exhibits a remarkable increase in ground-truth label space coverage on various datasets.
arXiv Detail & Related papers (2024-07-08T04:52:49Z) - The Benefits of Label-Description Training for Zero-Shot Text
Classification [35.27224341685012]
Pretrained language models have improved zero-shot text classification.
We propose a simple way to further improve zero-shot accuracies with minimal effort.
arXiv Detail & Related papers (2023-05-03T16:19:31Z) - Label Semantic Aware Pre-training for Few-shot Text Classification [53.80908620663974]
We propose Label Semantic Aware Pre-training (LSAP) to improve the generalization and data efficiency of text classification systems.
LSAP incorporates label semantics into pre-trained generative models (T5 in our case) by performing secondary pre-training on labeled sentences from a variety of domains.
arXiv Detail & Related papers (2022-04-14T17:33:34Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information [47.44278057062421]
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
arXiv Detail & Related papers (2021-11-07T07:39:10Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.