Minimally Supervised Categorization of Text with Metadata
- URL: http://arxiv.org/abs/2005.00624v3
- Date: Sat, 13 Nov 2021 05:48:47 GMT
- Title: Minimally Supervised Categorization of Text with Metadata
- Authors: Yu Zhang, Yu Meng, Jiaxin Huang, Frank F. Xu, Xuan Wang, Jiawei Han
- Abstract summary: We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
- Score: 40.13841133991089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Document categorization, which aims to assign a topic label to each document,
plays a fundamental role in a wide variety of applications. Despite the success
of existing studies in conventional supervised document classification, they
are less concerned with two real problems: (1) the presence of metadata: in
many domains, text is accompanied by various additional information such as
authors and tags. Such metadata serve as compelling topic indicators and should
be leveraged into the categorization framework; (2) label scarcity: labeled
training samples are expensive to obtain in some cases, where categorization
needs to be performed using only a small set of annotated data. In recognition
of these two challenges, we propose MetaCat, a minimally supervised framework
to categorize text with metadata. Specifically, we develop a generative process
describing the relationships between words, documents, labels, and metadata.
Guided by the generative model, we embed text and metadata into the same
semantic space to encode heterogeneous signals. Then, based on the same
generative process, we synthesize training samples to address the bottleneck of
label scarcity. We conduct a thorough evaluation on a wide range of datasets.
Experimental results prove the effectiveness of MetaCat over many competitive
baselines.
Related papers
- Metadata-Induced Contrastive Learning for Zero-Shot Multi-Label Text
Classification [27.33039900612395]
We propose a novel metadata-induced contrastive learning (MICoL) method for large-scale multi-label text classification.
MICoL exploits document metadata, which are widely available on the Web, to derive similar document-document pairs.
We show that MICoL significantly outperforms strong zero-shot text classification and contrastive learning baselines.
arXiv Detail & Related papers (2022-02-11T23:22:17Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information [47.44278057062421]
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
arXiv Detail & Related papers (2021-11-07T07:39:10Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Unsupervised Label Refinement Improves Dataless Text Classification [48.031421660674745]
Dataless text classification is capable of classifying documents into previously unseen labels by assigning a score to any document paired with a label description.
While promising, it crucially relies on accurate descriptions of the label set for each downstream task.
This reliance causes dataless classifiers to be highly sensitive to the choice of label descriptions and hinders the broader application of dataless classification in practice.
arXiv Detail & Related papers (2020-12-08T03:37:50Z) - Hierarchical Metadata-Aware Document Categorization under Weak
Supervision [32.80303008934164]
We develop HiMeCat, an embedding-based generative framework for our task.
We propose a novel joint representation learning module that allows simultaneous modeling of category dependencies.
We introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
arXiv Detail & Related papers (2020-10-26T13:07:56Z) - Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems.
We generate document representations that capture both text and metadata artifacts in a task manner.
Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.