Hierarchical Multi-Label Classification of Scientific Documents
- URL: http://arxiv.org/abs/2211.02810v1
- Date: Sat, 5 Nov 2022 04:12:57 GMT
- Title: Hierarchical Multi-Label Classification of Scientific Documents
- Authors: Mobashir Sadat, Cornelia Caragea
- Abstract summary: We introduce a new dataset for hierarchical multi-label text classification of scientific papers called SciHTC.
This dataset contains 186,160 papers and 1,233 categories from the ACM CCS tree.
Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities.
- Score: 47.293189105900524
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic topic classification has been studied extensively to assist
managing and indexing scientific documents in a digital collection. With the
large number of topics being available in recent years, it has become necessary
to arrange them in a hierarchy. Therefore, the automatic classification systems
need to be able to classify the documents hierarchically. In addition, each
paper is often assigned to more than one relevant topic. For example, a paper
can be assigned to several topics in a hierarchy tree. In this paper, we
introduce a new dataset for hierarchical multi-label text classification
(HMLTC) of scientific papers called SciHTC, which contains 186,160 papers and
1,233 categories from the ACM CCS tree. We establish strong baselines for HMLTC
and propose a multi-task learning approach for topic classification with
keyword labeling as an auxiliary task. Our best model achieves a Macro-F1 score
of 34.57% which shows that this dataset provides significant research
opportunities on hierarchical scientific topic classification. We make our
dataset and code available on Github.
Related papers
- Are Large Language Models Good Classifiers? A Study on Edit Intent Classification in Scientific Document Revisions [62.12545440385489]
Large language models (LLMs) have brought substantial advancements in text generation, but their potential for enhancing classification tasks remains underexplored.
We propose a framework for thoroughly investigating fine-tuning LLMs for classification, including both generation- and encoding-based approaches.
We instantiate this framework in edit intent classification (EIC), a challenging and underexplored classification task.
arXiv Detail & Related papers (2024-10-02T20:48:28Z) - Learning Section Weights for Multi-Label Document Classification [4.74495279742457]
Multi-label document classification is a traditional task in NLP.
We propose a new method called Learning Section Weights (LSW)
LSW learns to assign weights to each section of, and incorporate the weights in the prediction.
arXiv Detail & Related papers (2023-11-26T19:56:19Z) - Recent Advances in Hierarchical Multi-label Text Classification: A
Survey [11.709847202580505]
Hierarchical multi-label text classification aims to classify the input text into multiple labels, among which the labels are structured and hierarchical.
It is a vital task in many real world applications, e.g. scientific literature archiving.
arXiv Detail & Related papers (2023-07-30T16:13:00Z) - Weakly Supervised Multi-Label Classification of Full-Text Scientific
Papers [29.295941972777978]
We proposeEX, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision.
A network-aware contrastive fine-tuning module and a hierarchy-aware aggregation module are designed to leverage the two types of structural signals.
arXiv Detail & Related papers (2023-06-24T15:27:55Z) - Seeded Hierarchical Clustering for Expert-Crafted Taxonomies [48.10324642720299]
We propose HierSeed, a weakly supervised algorithm for fitting unlabeled corpora.
It is both data and efficient.
It outperforms both unsupervised and supervised baselines for the SHC task on three real-world datasets.
arXiv Detail & Related papers (2022-05-23T19:58:06Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel
Topic Clusters [57.59286394188025]
We propose a novel framework for topic taxonomy completion, named TaxoCom.
TaxoCom discovers novel sub-topic clusters of terms and documents.
Our comprehensive experiments on two real-world datasets demonstrate that TaxoCom not only generates the high-quality topic taxonomy in terms of term coherency and topic coverage.
arXiv Detail & Related papers (2022-01-18T07:07:38Z) - Label Hierarchy Transition: Delving into Class Hierarchies to Enhance
Deep Classifiers [40.993137740456014]
We propose a unified probabilistic framework based on deep learning to address the challenges of hierarchical classification.
The proposed framework can be readily adapted to any existing deep network with only minor modifications.
We extend our proposed LHT framework to the skin lesion diagnosis task and validate its great potential in computer-aided diagnosis.
arXiv Detail & Related papers (2021-12-04T14:58:36Z) - MATCH: Metadata-Aware Text Classification in A Large Hierarchy [60.59183151617578]
MATCH is an end-to-end framework that leverages both metadata and hierarchy information.
We propose different ways to regularize the parameters and output probability of each child label by its parents.
Experiments on two massive text datasets with large-scale label hierarchies demonstrate the effectiveness of MATCH.
arXiv Detail & Related papers (2021-02-15T05:23:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.