Classification and Clustering of arXiv Documents, Sections, and
Abstracts, Comparing Encodings of Natural and Mathematical Language
- URL: http://arxiv.org/abs/2005.11021v1
- Date: Fri, 22 May 2020 06:16:32 GMT
- Title: Classification and Clustering of arXiv Documents, Sections, and
Abstracts, Comparing Encodings of Natural and Mathematical Language
- Authors: Philipp Scharpf, Moritz Schubotz, Abdou Youssef, Felix Hamborg, Norman
Meuschke, Bela Gipp
- Abstract summary: We show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content.
Our encodings achieve classification accuracies up to $82.8%$ and cluster purities up to $69.4%$.
We show that the computer outperforms a human expert when classifying documents.
- Score: 8.522576207528017
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In this paper, we show how selecting and combining encodings of natural and
mathematical language affect classification and clustering of documents with
mathematical content. We demonstrate this by using sets of documents, sections,
and abstracts from the arXiv preprint server that are labeled by their subject
class (mathematics, computer science, physics, etc.) to compare different
encodings of text and formulae and evaluate the performance and runtimes of
selected classification and clustering algorithms. Our encodings achieve
classification accuracies up to $82.8\%$ and cluster purities up to $69.4\%$
(number of clusters equals number of classes), and $99.9\%$ (unspecified number
of clusters) respectively. We observe a relatively low correlation between text
and math similarity, which indicates the independence of text and formulae and
motivates treating them as separate features of a document. The classification
and clustering can be employed, e.g., for document search and recommendation.
Furthermore, we show that the computer outperforms a human expert when
classifying documents. Finally, we evaluate and discuss multi-label
classification and formula semantification.
Related papers
- CLIP-GCD: Simple Language Guided Generalized Category Discovery [21.778676607030253]
Generalized Category Discovery (GCD) requires a model to both classify known categories and cluster unknown categories in unlabeled data.
Prior methods leveraged self-supervised pre-training combined with supervised fine-tuning on the labeled data, followed by simple clustering methods.
We propose to leverage multi-modal (vision and language) models, in two complementary ways.
arXiv Detail & Related papers (2023-05-17T17:55:33Z) - Hierarchical Multi-Label Classification of Scientific Documents [47.293189105900524]
We introduce a new dataset for hierarchical multi-label text classification of scientific papers called SciHTC.
This dataset contains 186,160 papers and 1,233 categories from the ACM CCS tree.
Our best model achieves a Macro-F1 score of 34.57% which shows that this dataset provides significant research opportunities.
arXiv Detail & Related papers (2022-11-05T04:12:57Z) - Association Graph Learning for Multi-Task Classification with Category
Shifts [68.58829338426712]
We focus on multi-task classification, where related classification tasks share the same label space and are learned simultaneously.
We learn an association graph to transfer knowledge among tasks for missing classes.
Our method consistently performs better than representative baselines.
arXiv Detail & Related papers (2022-10-10T12:37:41Z) - Out-of-Category Document Identification Using Target-Category Names as
Weak Supervision [64.671654559798]
Out-of-category detection aims to distinguish documents according to their semantic relevance to the inlier (or target) categories.
We present an out-of-category detection framework, which effectively measures how confidently each document belongs to one of the target categories.
arXiv Detail & Related papers (2021-11-24T21:01:25Z) - LeQua@CLEF2022: Learning to Quantify [76.22817970624875]
LeQua 2022 is a new lab for the evaluation of methods for learning to quantify'' in textual datasets.
The goal of this lab is to provide a setting for the comparative evaluation of methods for learning to quantify, both in the binary setting and in the single-label multiclass setting.
arXiv Detail & Related papers (2021-11-22T14:54:20Z) - Conical Classification For Computationally Efficient One-Class Topic
Determination [0.0]
We propose a Conical classification approach to identify documents that relate to a particular topic.
We show in our analysis that our approach has higher predictive power on our datasets, and is also faster to compute.
arXiv Detail & Related papers (2021-10-31T01:27:12Z) - Towards Math-Aware Automated Classification and Similarity Search of
Scientific Publications: Methods of Mathematical Content Representations [0.456877715768796]
We investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents.
The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification.
arXiv Detail & Related papers (2021-10-08T11:27:40Z) - Generalized Funnelling: Ensemble Learning and Heterogeneous Document
Embeddings for Cross-Lingual Text Classification [78.83284164605473]
emphFunnelling (Fun) is a recently proposed method for cross-lingual text classification.
We describe emphGeneralized Funnelling (gFun) as a generalization of Fun.
We show that gFun substantially improves over Fun and over state-of-the-art baselines.
arXiv Detail & Related papers (2021-09-17T23:33:04Z) - DocSCAN: Unsupervised Text Classification via Learning from Neighbors [2.2082422928825145]
We introduce DocSCAN, a completely unsupervised text classification approach using Semantic Clustering by Adopting Nearest-Neighbors (SCAN)
For each document, we obtain semantically informative vectors from a large pre-trained language model. Similar documents have proximate vectors, so neighbors in the representation space tend to share topic labels.
Our learnable clustering approach uses pairs of neighboring datapoints as a weak learning signal. The proposed approach learns to assign classes to the whole dataset without provided ground-truth labels.
arXiv Detail & Related papers (2021-05-09T21:20:31Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - X-Class: Text Classification with Extremely Weak Supervision [39.25777650619999]
In this paper, we explore text classification with extremely weak supervision.
We propose a novel framework X-Class to realize the adaptive representations.
X-Class can rival and even outperform seed-driven weakly supervised methods on 7 benchmark datasets.
arXiv Detail & Related papers (2020-10-24T06:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.