AutoMSC: Automatic Assignment of Mathematics Subject Classification
Labels
- URL: http://arxiv.org/abs/2005.12099v2
- Date: Mon, 9 Nov 2020 07:12:34 GMT
- Title: AutoMSC: Automatic Assignment of Mathematics Subject Classification
Labels
- Authors: Moritz Schubotz and Philipp Scharpf and Olaf Teschke and Andreas
Kuehnemund and Corinna Breitinger and Bela Gipp
- Abstract summary: We investigate the feasibility of automatically assigning a coarse-grained primary classification using the Mathematics Subject Classification scheme.
We find that our method achieves an (F_1)-score of over 77%, which is remarkably close to the agreement of zbMATH and MR.
- Score: 4.001125251113153
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Authors of research papers in the fields of mathematics, and other math-heavy
disciplines commonly employ the Mathematics Subject Classification (MSC) scheme
to search for relevant literature. The MSC is a hierarchical alphanumerical
classification scheme that allows librarians to specify one or multiple codes
for publications. Digital Libraries in Mathematics, as well as reviewing
services, such as zbMATH and Mathematical Reviews (MR) rely on these MSC labels
in their workflows to organize the abstracting and reviewing process.
Especially, the coarse-grained classification determines the subject editor who
is responsible for the actual reviewing process.
In this paper, we investigate the feasibility of automatically assigning a
coarse-grained primary classification using the MSC scheme, by regarding the
problem as a multi-class classification machine learning task. We find that our
method achieves an (F_1)-score of over 77%, which is remarkably close to the
agreement of zbMATH and MR ((F_1)-score of 81%). Moreover, we find that the
method's confidence score allows for reducing the effort by 86% compared to the
manual coarse-grained classification effort while maintaining a precision of
81% for automatically classified articles.
Related papers
- Prompt Tuned Embedding Classification for Multi-Label Industry Sector Allocation [2.024620791810963]
This study benchmarks the performance of Prompt Tuning and baselines for multi-label text classification.
It is applied to classifying companies into an investment firm's proprietary industry taxonomy.
We confirm that the model's performance is consistent across both well-known and less-known companies.
arXiv Detail & Related papers (2023-09-21T13:45:32Z) - Pearson-Matthews correlation coefficients for binary and multinary
classification and hypothesis testing [6.974999794070285]
Multinary classification is the main focus of this paper.
We show that both $textR_textK$ and the MPC metrics suffer from the problem of not decisively indicating poor classification results when they should.
We also present an additional new metric for multinary classification which can be viewed as a direct extension of MCC.
arXiv Detail & Related papers (2023-05-10T08:32:36Z) - A Gold Standard Dataset for the Reviewer Assignment Problem [117.59690218507565]
"Similarity score" is a numerical estimate of the expertise of a reviewer in reviewing a paper.
Our dataset consists of 477 self-reported expertise scores provided by 58 researchers.
For the task of ordering two papers in terms of their relevance for a reviewer, the error rates range from 12%-30% in easy cases to 36%-43% in hard cases.
arXiv Detail & Related papers (2023-03-23T16:15:03Z) - Document Provenance and Authentication through Authorship Classification [5.2545206693029884]
We propose an ensemble-based text-processing framework for the classification of single and multi-authored documents.
The proposed framework incorporates several state-of-the-art text classification algorithms.
The framework is evaluated on a large-scale benchmark dataset.
arXiv Detail & Related papers (2023-03-02T12:26:03Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - Rank4Class: A Ranking Formulation for Multiclass Classification [26.47229268790206]
Multiclass classification (MCC) is a fundamental machine learning problem.
We show that it is easy to boost MCC performance with a novel formulation through the lens of ranking.
arXiv Detail & Related papers (2021-12-17T19:22:37Z) - CLICKER: A Computational LInguistics Classification Scheme for
Educational Resources [47.48935730905393]
A classification scheme of a scientific subject gives an overview of its body of knowledge.
A comprehensive classification system like CCS or Mathematics Subject Classification (MSC) does not exist for Computational Linguistics (CL) and Natural Language Processing (NLP)
We propose a classification scheme -- CLICKER for CL/NLP based on the analysis of online lectures from 77 university courses on this subject.
arXiv Detail & Related papers (2021-12-16T02:40:43Z) - Towards Math-Aware Automated Classification and Similarity Search of
Scientific Publications: Methods of Mathematical Content Representations [0.456877715768796]
We investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents.
The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification.
arXiv Detail & Related papers (2021-10-08T11:27:40Z) - CoPHE: A Count-Preserving Hierarchical Evaluation Metric in Large-Scale
Multi-Label Text Classification [70.554573538777]
We argue for hierarchical evaluation of the predictions of neural LMTC models.
We describe a structural issue in the representation of the structured label space in prior art.
We propose a set of metrics for hierarchical evaluation using the depth-based representation.
arXiv Detail & Related papers (2021-09-10T13:09:12Z) - Binary Classification from Multiple Unlabeled Datasets via Surrogate Set
Classification [94.55805516167369]
We propose a new approach for binary classification from m U-sets for $mge2$.
Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC)
arXiv Detail & Related papers (2021-02-01T07:36:38Z) - Multitask Learning for Class-Imbalanced Discourse Classification [74.41900374452472]
We show that a multitask approach can improve 7% Micro F1-score upon current state-of-the-art benchmarks.
We also offer a comparative review of additional techniques proposed to address resource-poor problems in NLP.
arXiv Detail & Related papers (2021-01-02T07:13:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.