DocSCAN: Unsupervised Text Classification via Learning from Neighbors
- URL: http://arxiv.org/abs/2105.04024v2
- Date: Tue, 11 May 2021 12:32:04 GMT
- Title: DocSCAN: Unsupervised Text Classification via Learning from Neighbors
- Authors: Dominik Stammbach, Elliott Ash
- Abstract summary: We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
- Score: 2.2082422928825145
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce DocSCAN, a completely unsupervised text classification approach
using Semantic Clustering by Adopting Nearest-Neighbors (SCAN). For each
document, we obtain semantically informative vectors from a large pre-trained
language model. Similar documents have proximate vectors, so neighbors in the
representation space tend to share topic labels. Our learnable clustering
approach uses pairs of neighboring datapoints as a weak learning signal. The
proposed approach learns to assign classes to the whole dataset without
provided ground-truth labels. On five topic classification benchmarks, we
improve on various unsupervised baselines by a large margin. In datasets with
relatively few and balanced outcome classes, DocSCAN approaches the performance
of supervised classification. The method fails for other types of
classification, such as sentiment analysis, pointing to important conceptual
and practical differences between classifying images and texts.
Related papers
- Lidar Panoptic Segmentation in an Open World [50.094491113541046]
Lidar Panoptics (LPS) is crucial for safe deployment of autonomous vehicles.
LPS aims to recognize and segment lidar points wr.t. a pre-defined vocabulary of semantic classes.
We propose a class-agnostic point clustering and over-segment the input cloud in a hierarchical fashion, followed by binary point segment classification.
arXiv Detail & Related papers (2024-09-22T00:10:20Z) - Leveraging Semantic Segmentation Masks with Embeddings for Fine-Grained Form Classification [0.0]
Efficient categorization of historical documents is crucial for fields such as genealogy, legal research and historical scholarship.
We propose a representational learning strategy that integrates deep learning models such as ResNet, masked Image Transformer (Di), and embedding segmentation.
arXiv Detail & Related papers (2024-05-23T04:28:50Z) - CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - X-Class: Text Classification with Extremely Weak Supervision [39.25777650619999]
In this paper, we explore text classification with extremely weak supervision.
We propose a novel framework X-Class to realize the adaptive representations.
X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets.
arXiv Detail & Related papers (2020-10-24T06:09:51Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.