Towards Math-Aware Automated Classification and Similarity Search of
Scientific Publications: Methods of Mathematical Content Representations
- URL: http://arxiv.org/abs/2110.04040v1
- Date: Fri, 8 Oct 2021 11:27:40 GMT
- Title: Towards Math-Aware Automated Classification and Similarity Search of
Scientific Publications: Methods of Mathematical Content Representations
- Authors: Michal R\r{u}\v{z}i\v{c}ka, Petr Sojka
- Abstract summary: We investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents.
The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification.
- Score: 0.456877715768796
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we investigate mathematical content representations suitable
for the automated classification of and the similarity search in STEM documents
using standard machine learning algorithms: the Latent Dirichlet Allocation
(LDA) and the Latent Semantic Indexing (LSI). The methods are evaluated on a
subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as
a reference classification and using the standard precision/recall/F1-measure
metrics. The results give insight into how different math representations may
influence the performance of the classification and similarity search tasks in
STEM repositories. Non-surprisingly, machine learning methods are able to grab
distributional semantics from textual tokens. A proper selection of weighted
tokens representing math may improve the quality of the results slightly. A
structured math representation that imitates successful text-processing
techniques with math is shown to yield better results than flat TeX tokens.
Related papers
- STEM-POM: Evaluating Language Models Math-Symbol Reasoning in Document Parsing [2.2315518704035595]
We introduce STEM-PoM, a benchmark dataset to evaluate large language models' reasoning abilities on math symbols.
The dataset contains over 2K math symbols classified as main attributes of variables, constants, operators, and unit descriptors.
Our experiments show that state-of-the-art LLMs achieve an average of 20-60% accuracy under in-context learning and 50-60% accuracy with fine-tuning.
arXiv Detail & Related papers (2024-11-01T06:25:06Z) - Semantic Graph Representation Learning for Handwritten Mathematical
Expression Recognition [57.60390958736775]
We propose a simple but efficient method to enhance semantic interaction learning (SIL)
We first construct a semantic graph based on the statistical symbol co-occurrence probabilities.
Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space.
Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.
arXiv Detail & Related papers (2023-08-21T06:23:41Z) - Provably Efficient Representation Learning with Tractable Planning in
Low-Rank POMDP [81.00800920928621]
We study representation learning in partially observable Markov Decision Processes (POMDPs)
We first present an algorithm for decodable POMDPs that combines maximum likelihood estimation (MLE) and optimism in the face of uncertainty (OFU)
We then show how to adapt this algorithm to also work in the broader class of $gamma$-observable POMDPs.
arXiv Detail & Related papers (2023-06-21T16:04:03Z) - Learning Context-aware Classifier for Semantic Segmentation [88.88198210948426]
In this paper, contextual hints are exploited via learning a context-aware classifier.
Our method is model-agnostic and can be easily applied to generic segmentation models.
With only negligible additional parameters and +2% inference time, decent performance gain has been achieved on both small and large models.
arXiv Detail & Related papers (2023-03-21T07:00:35Z) - Unified Functional Hashing in Automatic Machine Learning [58.77232199682271]
We show that large efficiency gains can be obtained by employing a fast unified functional hash.
Our hash is "functional" in that it identifies equivalent candidates even if they were represented or coded differently.
We show dramatic improvements on multiple AutoML domains, including neural architecture search and algorithm discovery.
arXiv Detail & Related papers (2023-02-10T18:50:37Z) - Self-Supervised Pretraining of Graph Neural Network for the Retrieval of
Related Mathematical Expressions in Scientific Articles [8.942112181408156]
We propose a new approach for retrieval of mathematical expressions based on machine learning.
We design an unsupervised representation learning task that combines embedding learning with self-supervised learning.
We collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org.
arXiv Detail & Related papers (2022-08-22T12:11:30Z) - Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications.
Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture.
We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z) - How Fine-Tuning Allows for Effective Meta-Learning [50.17896588738377]
We present a theoretical framework for analyzing representations derived from a MAML-like algorithm.
We provide risk bounds on the best predictor found by fine-tuning via gradient descent, demonstrating that the algorithm can provably leverage the shared structure.
This separation result underscores the benefit of fine-tuning-based methods, such as MAML, over methods with "frozen representation" objectives in few-shot learning.
arXiv Detail & Related papers (2021-05-05T17:56:00Z) - AutoMSC: Automatic Assignment of Mathematics Subject Classification
Labels [4.001125251113153]
We investigate the feasibility of automatically assigning a coarse-grained primary classification using the Mathematics Subject Classification scheme.
We find that our method achieves an (F_1)-score of over 77%, which is remarkably close to the agreement of zbMATH and MR.
arXiv Detail & Related papers (2020-05-25T13:26:45Z) - Classification and Clustering of arXiv Documents, Sections, and
Abstracts, Comparing Encodings of Natural and Mathematical Language [8.522576207528017]
We show how selecting and combining encodings of natural and mathematical language affect classification and clustering of documents with mathematical content.
Our encodings achieve classification accuracies up to $82.8%$ and cluster purities up to $69.4%$.
We show that the computer outperforms a human expert when classifying documents.
arXiv Detail & Related papers (2020-05-22T06:16:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.