Weakly Supervised Multi-Label Classification of Full-Text Scientific
Papers
- URL: http://arxiv.org/abs/2306.14003v1
- Date: Sat, 24 Jun 2023 15:27:55 GMT
- Title: Weakly Supervised Multi-Label Classification of Full-Text Scientific
Papers
- Authors: Yu Zhang, Bowen Jin, Xiusi Chen, Yanzhen Shen, Yunyi Zhang, Yu Meng,
Jiawei Han
- Abstract summary: We proposeEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision.
A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals.
- Score: 29.295941972777978
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instead of relying on human-annotated training samples to build a classifier,
weakly supervised scientific paper classification aims to classify papers only
using category descriptions (e.g., category names, category-indicative
keywords). Existing studies on weakly supervised paper classification are less
concerned with two challenges: (1) Papers should be classified into not only
coarse-grained research topics but also fine-grained themes, and potentially
into multiple themes, given a large and fine-grained label space; and (2) full
text should be utilized to complement the paper title and abstract for
classification. Moreover, instead of viewing the entire paper as a long linear
sequence, one should exploit the structural information such as citation links
across papers and the hierarchy of sections and paragraphs in each paper. To
tackle these challenges, in this study, we propose FUTEX, a framework that uses
the cross-paper network structure and the in-paper hierarchy structure to
classify full-text scientific papers under weak supervision. A network-aware
contrastive fine-tuning module and a hierarchy-aware aggregation module are
designed to leverage the two types of structural signals, respectively.
Experiments on two benchmark datasets demonstrate that FUTEX significantly
outperforms competitive baselines and is on par with fully supervised
classifiers that use 1,000 to 60,000 ground-truth training samples.
Related papers
- Hierarchical Multi-Label Classification of Scientific Documents [47.293189105900524]
We introduce a new dataset for hierarchical multi-label text classification of scientific papers called SciHTC.
This dataset contains 186,160 papers and 1,233 categories from the ACM CCS tree.
Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities.
arXiv Detail & Related papers (2022-11-05T04:12:57Z) - TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel
Topic Clusters [57.59286394188025]
We propose a novel framework for topic taxonomy completion, named TaxoCom.
TaxoCom discovers novel sub-topic clusters of terms and documents.
Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage.
arXiv Detail & Related papers (2022-01-18T07:07:38Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - MotifClass: Weakly Supervised Text Classification with Higher-order
Metadata Information [47.44278057062421]
We study the problem of weakly supervised text classification, which aims to classify text documents into a set of pre-defined categories with category surface names only.
To be specific, we model the relationships between documents and metadata via a heterogeneous information network.
We propose a novel framework, named MotifClass, which selects category-indicative motif instances, retrieves and generates pseudo-labeled training samples based on category names and indicative motif instances.
arXiv Detail & Related papers (2021-11-07T07:39:10Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z) - Hierarchical Metadata-Aware Document Categorization under Weak
Supervision [32.80303008934164]
We develop HiMeCat, an embedding-based generative framework for our task.
We propose a novel joint representation learning module that allows simultaneous modeling of category dependencies.
We introduce a data augmentation module that hierarchically synthesizes training documents to complement the original, small-scale training set.
arXiv Detail & Related papers (2020-10-26T13:07:56Z) - Description Based Text Classification with Reinforcement Learning [34.18824470728299]
We propose a new framework for text classification, in which each category label is associated with a category description.
We observe significant performance boosts over strong baselines on a wide range of text classification tasks.
arXiv Detail & Related papers (2020-02-08T02:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.