Towards Open-Domain Topic Classification
- URL: http://arxiv.org/abs/2306.17290v1
- Date: Thu, 29 Jun 2023 20:25:28 GMT
- Title: Towards Open-Domain Topic Classification
- Authors: Hantian Ding, Jinrui Yang, Yuqian Deng, Hongming Zhang, Dan Roth
- Abstract summary: We introduce an open-domain topic classification system that accepts user-defined taxonomy in real time.
Users will be able to classify a text snippet with respect to any candidate labels they want, and get instant response from our web interface.
- Score: 69.21234350688098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce an open-domain topic classification system that accepts
user-defined taxonomy in real time. Users will be able to classify a text
snippet with respect to any candidate labels they want, and get instant
response from our web interface. To obtain such flexibility, we build the
backend model in a zero-shot way. By training on a new dataset constructed from
Wikipedia, our label-aware text classifier can effectively utilize implicit
knowledge in the pretrained language model to handle labels it has never seen
before. We evaluate our model across four datasets from various domains with
different label sets. Experiments show that the model significantly improves
over existing zero-shot baselines in open-domain scenarios, and performs
competitively with weakly-supervised models trained on in-domain data.
Related papers
- From Categories to Classifiers: Name-Only Continual Learning by Exploring the Web [118.67589717634281]
Continual learning often relies on the availability of extensive annotated datasets, an assumption that is unrealistically time-consuming and costly in practice.
We explore a novel paradigm termed name-only continual learning where time and cost constraints prohibit manual annotation.
Our proposed solution leverages the expansive and ever-evolving internet to query and download uncurated webly-supervised data for image classification.
arXiv Detail & Related papers (2023-11-19T10:43:43Z) - Using Psuedolabels for training Sentiment Classifiers makes the model
generalize better across datasets [0.0]
For a public sentiment classification API, how can we set up a classifier that works well on different types of data, having limited ability to annotate data from across domains?
We show that given a large amount of unannotated data from across different domains and pseudolabels on this dataset, we can train a sentiment classifier that generalizes better across different datasets.
arXiv Detail & Related papers (2021-10-05T17:47:15Z) - Zero-Shot Federated Learning with New Classes for Audio Classification [0.7106986689736827]
Federated learning is an effective way of extracting insights from different user devices.
New classes with completely unseen data distributions can stream across any device in a federated learning setting.
We propose a unified zero-shot framework to handle these aforementioned challenges during federated learning.
arXiv Detail & Related papers (2021-06-18T09:32:19Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Automatic Discovery of Novel Intents & Domains from Text Utterances [18.39942131996558]
We propose a novel framework, ADVIN, to automatically discover novel domains and intents from large volumes of unlabeled data.
ADVIN significantly outperforms baselines on three benchmark datasets, and real user utterances from a commercial voice-powered agent.
arXiv Detail & Related papers (2020-05-22T00:47:10Z) - Named Entity Recognition without Labelled Data: A Weak Supervision
Approach [23.05371427663683]
This paper presents a simple but powerful approach to learn NER models in the absence of labelled data through weak supervision.
The approach relies on a broad spectrum of labelling functions to automatically annotate texts from the target domain.
A sequence labelling model can finally be trained on the basis of this unified annotation.
arXiv Detail & Related papers (2020-04-30T12:29:55Z) - Unsupervised Domain Clusters in Pretrained Language Models [61.832234606157286]
We show that massive pre-trained language models implicitly learn sentence representations that cluster by domains without supervision.
We propose domain data selection methods based on such models.
We evaluate our data selection methods for neural machine translation across five diverse domains.
arXiv Detail & Related papers (2020-04-05T06:22:16Z) - Selecting Relevant Features from a Multi-domain Representation for
Few-shot Classification [91.67977602992657]
We propose a new strategy based on feature selection, which is both simpler and more effective than previous feature adaptation approaches.
We show that a simple non-parametric classifier built on top of such features produces high accuracy and generalizes to domains never seen during training.
arXiv Detail & Related papers (2020-03-20T15:44:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.