Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks
- URL: http://arxiv.org/abs/2102.11479v1
- Date: Tue, 23 Feb 2021 04:14:34 GMT
- Title: Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks
- Authors: Xinyang Zhang, Chenwei Zhang, Luna Xin Dong, Jingbo Shang, Jiawei Han
- Abstract summary: We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
- Score: 61.23408995934415
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Text categorization is an essential task in Web content analysis. Considering
the ever-evolving Web data and new emerging categories, instead of the
laborious supervised setting, in this paper, we focus on the
minimally-supervised setting that aims to categorize documents effectively,
with a couple of seed documents annotated per category. We recognize that texts
collected from the Web are often structure-rich, i.e., accompanied by various
metadata. One can easily organize the corpus into a text-rich network, joining
raw text documents with document attributes, high-quality phrases, label
surface names as nodes, and their associations as edges. Such a network
provides a holistic view of the corpus' heterogeneous data sources and enables
a joint optimization for network-based analysis and deep textual model
training. We therefore propose a novel framework for minimally supervised
categorization by learning from the text-rich network. Specifically, we jointly
train two modules with different inductive biases -- a text analysis module for
text understanding and a network learning module for class-discriminative,
scalable network learning. Each module generates pseudo training labels from
the unlabeled document set, and both modules mutually enhance each other by
co-training using pooled pseudo labels. We test our model on two real-world
datasets. On the challenging e-commerce product categorization dataset with 683
categories, our experiments show that given only three seed documents per
category, our framework can achieve an accuracy of about 92%, significantly
outperforming all compared methods; our accuracy is only less than 2% away from
the supervised BERT model trained on about 50K labeled documents.
Related papers
- Weakly Supervised Multi-Label Classification of Full-Text Scientific
Papers [29.295941972777978]
We proposeEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision.
A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals.
arXiv Detail & Related papers (2023-06-24T15:27:55Z) - Patton: Language Model Pretraining on Text-Rich Networks [33.914163727649466]
We propose PretrAining on TexT-Rich NetwOrk framework Patton for text-rich networks.
Patton includes two pretraining strategies: network-contextualized masked language modeling and masked node prediction.
We conduct experiments on four downstream tasks in five datasets from both academic and e-commerce domains.
arXiv Detail & Related papers (2023-05-20T19:17:10Z) - TRIE++: Towards End-to-End Information Extraction from Visually Rich
Documents [51.744527199305445]
This paper proposes a unified end-to-end information extraction framework from visually rich documents.
Text reading and information extraction can reinforce each other via a well-designed multi-modal context block.
The framework can be trained in an end-to-end trainable manner, achieving global optimization.
arXiv Detail & Related papers (2022-07-14T08:52:07Z) - TeKo: Text-Rich Graph Neural Networks with External Knowledge [75.91477450060808]
We propose a novel text-rich graph neural network with external knowledge (TeKo)
We first present a flexible heterogeneous semantic network that incorporates high-quality entities.
We then introduce two types of external knowledge, that is, structured triplets and unstructured entity description.
arXiv Detail & Related papers (2022-06-15T02:33:10Z) - MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information [47.44278057062421]
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
arXiv Detail & Related papers (2021-11-07T07:39:10Z) - Pre-training Language Model Incorporating Domain-specific Heterogeneous Knowledge into A Unified Representation [49.89831914386982]
We propose a unified pre-trained language model (PLM) for all forms of text, including unstructured text, semi-structured text, and well-structured text.
Our approach outperforms the pre-training of plain text using only 1/4 of the data.
arXiv Detail & Related papers (2021-09-02T16:05:24Z) - Hierarchical Metadata-Aware Document Categorization under Weak
Supervision [32.80303008934164]
We develop HiMeCat, an embedding-based generative framework for our task.
We propose a novel joint representation learning module that allows simultaneous modeling of category dependencies.
We introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
arXiv Detail & Related papers (2020-10-26T13:07:56Z) - Minimally Supervised Categorization of Text with Metadata [40.13841133991089]
We propose MetaCat, a minimally supervised framework to categorize text with metadata.
We develop a generative process describing the relationships between words, documents, labels, and metadata.
Based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity.
arXiv Detail & Related papers (2020-05-01T21:42:32Z) - Learning to Select Bi-Aspect Information for Document-Scale Text Content
Manipulation [50.01708049531156]
We focus on a new practical task, document-scale text content manipulation, which is the opposite of text style transfer.
In detail, the input is a set of structured records and a reference text for describing another recordset.
The output is a summary that accurately describes the partial content in the source recordset with the same writing style of the reference.
arXiv Detail & Related papers (2020-02-24T12:52:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.