MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information
- URL: http://arxiv.org/abs/2111.04022v1
- Date: Sun, 7 Nov 2021 07:39:10 GMT
- Title: MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information
- Authors: Yu Zhang, Shweta Garg, Yu Meng, Xiusi Chen, Jiawei Han
- Abstract summary: We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
- Score: 47.44278057062421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We study the problem of weakly supervised text classification, which aims to
classify text documents into a set of pre-defined categories with category
surface names only and without any annotated training document provided. Most
existing approaches leverage textual information in each document. However, in
many domains, documents are accompanied by various types of metadata (e.g.,
authors, venue, and year of a research paper). These metadata and their
combinations may serve as strong category indicators in addition to textual
contents. In this paper, we explore the potential of using metadata to help
weakly supervised text classification. To be specific, we model the
relationships between documents and metadata via a heterogeneous information
network. To effectively capture higher-order structures in the network, we use
motifs to describe metadata combinations. We propose a novel framework, named
MotifClass, which (1) selects category-indicative motif instances, (2)
retrieves and generates pseudo-labeled training samples based on category names
and indicative motif instances, and (3) trains a text classifier using the
pseudo training data. Extensive experiments on real-world datasets demonstrate
the superior performance of MotifClass to existing weakly supervised text
classification approaches. Further analysis shows the benefit of considering
higher-order metadata information in our framework.
Related papers
- Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Hierarchical Metadata-Aware Document Categorization under Weak
Supervision [32.80303008934164]
We develop HiMeCat, an embedding-based generative framework for our task.
We propose a novel joint representation learning module that allows simultaneous modeling of category dependencies.
We introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
arXiv Detail & Related papers (2020-10-26T13:07:56Z) - Text Classification Using Label Names Only: A Language Model
Self-Training Approach [80.63885282358204]
Current text classification methods typically require a good number of human-labeled documents as training data.
We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification.
arXiv Detail & Related papers (2020-10-14T17:06:41Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z) - Description Based Text Classification with Reinforcement Learning [34.18824470728299]
We propose a new framework for text classification, in which each category label is associated with a category description.
We observe significant performance boosts over strong baselines on a wide range of text classification tasks.
arXiv Detail & Related papers (2020-02-08T02:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.