Unsupervised Label Refinement Improves Dataless Text Classification
- URL: http://arxiv.org/abs/2012.04194v1
- Date: Tue, 8 Dec 2020 03:37:50 GMT
- Title: Unsupervised Label Refinement Improves Dataless Text Classification
- Authors: Zewei Chu, Karl Stratos, Kevin Gimpel
- Abstract summary: Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
- Score: 48.031421660674745
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dataless text classification is capable of classifying documents into
previously unseen labels by assigning a score to any document paired with a
label description. While promising, it crucially relies on accurate
descriptions of the label set for each downstream task. This reliance causes
dataless classifiers to be highly sensitive to the choice of label descriptions
and hinders the broader application of dataless classification in practice. In
this paper, we ask the following question: how can we improve dataless text
classification using the inputs of the downstream task dataset? Our primary
solution is a clustering based approach. Given a dataless classifier, our
approach refines its set of predictions using k-means clustering. We
demonstrate the broad applicability of our approach by improving the
performance of two widely used classifier architectures, one that encodes
text-category pairs with two independent encoders and one with a single joint
encoder. Experiments show that our approach consistently improves dataless
classification across different datasets and makes the classifier more robust
to the choice of label descriptions.
Related papers
- Posterior Label Smoothing for Node Classification [2.737276507021477]
We propose a simple yet effective label smoothing for the transductive node classification task.
We design the soft label to encapsulate the local context of the target node through the neighborhood label distribution.
In the following analysis, we find that incorporating global label statistics in posterior computation is the key to the success of label smoothing.
arXiv Detail & Related papers (2024-06-01T11:59:49Z) - Generalized Category Discovery with Clustering Assignment Consistency [56.92546133591019]
Generalized category discovery (GCD) is a recently proposed open-world task.
We propose a co-training-based framework that encourages clustering consistency.
Our method achieves state-of-the-art performance on three generic benchmarks and three fine-grained visual recognition datasets.
arXiv Detail & Related papers (2023-10-30T00:32:47Z) - Label Semantic Aware Pre-training for Few-shot Text Classification [53.80908620663974]
We propose Label Semantic Aware Pre-training (LSAP) to improve the generalization and data efficiency of text classification systems.
LSAP incorporates label semantics into pre-trained generative models (T5 in our case) by performing secondary pre-training on labeled sentences from a variety of domains.
arXiv Detail & Related papers (2022-04-14T17:33:34Z) - Improving Probabilistic Models in Text Classification via Active
Learning [0.0]
We propose a fast new model for text classification that combines information from both labeled and unlabeled data with an active learning component.
We show that by introducing information about the structure of unlabeled data and iteratively labeling uncertain documents, our model improves performance.
arXiv Detail & Related papers (2022-02-05T20:09:26Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - Information-theoretic Classification Accuracy: A Criterion that Guides
Data-driven Combination of Ambiguous Outcome Labels in Multi-class
Classification [3.9533511130413137]
Outcome labeling ambiguity and subjectivity are ubiquitous in real-world datasets.
We propose the information-theoretic classification accuracy (ITCA) to guide practitioners on how to combine ambiguous outcome labels.
We demonstrate the effectiveness of ITCA in diverse applications including medical prognosis, cancer survival prediction, user demographics prediction, and cell type classification.
arXiv Detail & Related papers (2021-09-01T19:20:28Z) - DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Predictive K-means with local models [0.028675177318965035]
Predictive clustering seeks to obtain the best of the two worlds.
We present two new algorithms using this technique and show on a variety of data sets that they are competitive for prediction performance.
arXiv Detail & Related papers (2020-12-16T10:49:36Z) - Interaction Matching for Long-Tail Multi-Label Classification [57.262792333593644]
We present an elegant and effective approach for addressing limitations in existing multi-label classification models.
By performing soft n-gram interaction matching, we match labels with natural language descriptions.
arXiv Detail & Related papers (2020-05-18T15:27:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.