Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision
- URL: http://arxiv.org/abs/2111.12796v1
- Date: Wed, 24 Nov 2021 21:01:25 GMT
- Title: Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision
- Authors: Dongha Lee, Dongmin Hyun, Jiawei Han, Hwanjo Yu
- Abstract summary: Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
- Score: 64.671654559798
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Identifying outlier documents, whose content is different from the majority
of the documents in a corpus, has played an important role to manage a large
text collection. However, due to the absence of explicit information about the
inlier (or target) distribution, existing unsupervised outlier detectors are
likely to make unreliable results depending on the density or diversity of the
outliers in the corpus. To address this challenge, we introduce a new task
referred to as out-of-category detection, which aims to distinguish the
documents according to their semantic relevance to the inlier (or target)
categories by using the category names as weak supervision. In practice, this
task can be widely applicable in that it can flexibly designate the scope of
target categories according to users' interests while requiring only the
target-category names as minimum guidance. In this paper, we present an
out-of-category detection framework, which effectively measures how confidently
each document belongs to one of the target categories based on its
category-specific relevance score. Our framework adopts a two-step approach;
(i) it first generates the pseudo-category label of all unlabeled documents by
exploiting the word-document similarity encoded in a text embedding space, then
(ii) it trains a neural classifier by using the pseudo-labels in order to
compute the confidence from its target-category prediction. The experiments on
real-world datasets demonstrate that our framework achieves the best detection
performance among all baseline methods in various scenarios specifying
different target categories.
Related papers
- FastClass: A Time-Efficient Approach to Weakly-Supervised Text
Classification [14.918600168973564]
This paper proposes FastClass, an efficient weakly-supervised classification approach.
It uses dense text representation to retrieve class-relevant documents from external unlabeled corpus.
Experiments show that the proposed approach frequently outperforms keyword-driven models in terms of classification accuracy and often enjoys orders-of-magnitude faster training speed.
arXiv Detail & Related papers (2022-12-11T13:43:22Z) - Cluster-to-adapt: Few Shot Domain Adaptation for Semantic Segmentation
across Disjoint Labels [80.05697343811893]
Cluster-to-Adapt (C2A) is a computationally efficient clustering-based approach for domain adaptation across segmentation datasets.
We show that such a clustering objective enforced in a transformed feature space serves to automatically select categories across source and target domains.
arXiv Detail & Related papers (2022-08-04T17:57:52Z) - MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information [47.44278057062421]
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
arXiv Detail & Related papers (2021-11-07T07:39:10Z) - DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Few-shot Learning for Multi-label Intent Detection [59.66787898744991]
State-of-the-art work estimates label-instance relevance scores and uses a threshold to select multiple associated intent labels.
Experiments on two datasets show that the proposed model significantly outperforms strong baselines in both one-shot and five-shot settings.
arXiv Detail & Related papers (2020-10-11T14:42:18Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.