Self-Supervised Pretraining of Graph Neural Network for the Retrieval of
Related Mathematical Expressions in Scientific Articles
- URL: http://arxiv.org/abs/2209.00446v1
- Date: Mon, 22 Aug 2022 12:11:30 GMT
- Title: Self-Supervised Pretraining of Graph Neural Network for the Retrieval of
Related Mathematical Expressions in Scientific Articles
- Authors: Lukas Pfahler, Katharina Morik
- Abstract summary: We propose a new approach for retrieval of mathematical expressions based on machine learning.
We design an unsupervised representation learning task that combines embedding learning with self-supervised learning.
We collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org.
- Score: 8.942112181408156
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Given the increase of publications, search for relevant papers becomes
tedious. In particular, search across disciplines or schools of thinking is not
supported. This is mainly due to the retrieval with keyword queries: technical
terms differ in different sciences or at different times. Relevant articles
might better be identified by their mathematical problem descriptions. Just
looking at the equations in a paper already gives a hint to whether the paper
is relevant. Hence, we propose a new approach for retrieval of mathematical
expressions based on machine learning. We design an unsupervised representation
learning task that combines embedding learning with self-supervised learning.
Using graph convolutional neural networks we embed mathematical expression into
low-dimensional vector spaces that allow efficient nearest neighbor queries. To
train our models, we collect a huge dataset with over 29 million mathematical
expressions from over 900,000 publications published on arXiv.org. The math is
converted into an XML format, which we view as graph data. Our empirical
evaluations involving a new dataset of manually annotated search queries show
the benefits of using embedding models for mathematical retrieval.
This work was originally published at KDD 2020.
Related papers
- Automated conjecturing in mathematics with \emph{TxGraffiti} [0.0]
emphTxGraffiti is a data-driven computer program developed to automate the process of generating conjectures.
We present the design and core principles of emphTxGraffiti, including its roots in the original emphGraffiti program.
arXiv Detail & Related papers (2024-09-28T15:06:31Z) - Discovering symbolic expressions with parallelized tree search [59.92040079807524]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data.
Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity.
We introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z) - Artificial intelligence and machine learning generated conjectures with TxGraffiti [0.0]
We outline the machine learning and techniques implemented by TxGraffiti.
We also announce a new online version of the program available for anyone curious to explore conjectures in graph theory.
arXiv Detail & Related papers (2024-07-03T01:03:09Z) - OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text [32.15651290548974]
We introduce OpenWebMath, an open dataset inspired by works containing 14.7B tokens of webpages from Common Crawl.
We run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
arXiv Detail & Related papers (2023-10-10T16:57:28Z) - Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types.
We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z) - Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings.
Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework.
Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z) - Towards Math-Aware Automated Classification and Similarity Search of
Scientific Publications: Methods of Mathematical Content Representations [0.456877715768796]
We investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents.
The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification.
arXiv Detail & Related papers (2021-10-08T11:27:40Z) - Temporal Graph Network Embedding with Causal Anonymous Walks
Representations [54.05212871508062]
We propose a novel approach for dynamic network representation learning based on Temporal Graph Network.
For evaluation, we provide a benchmark pipeline for the evaluation of temporal network embeddings.
We show the applicability and superior performance of our model in the real-world downstream graph machine learning task provided by one of the top European banks.
arXiv Detail & Related papers (2021-08-19T15:39:52Z) - Learning to Match Mathematical Statements with Proofs [37.38969121408295]
The task is designed to improve the processing of research-level mathematical texts.
We release a dataset for the task, consisting of over 180k statement-proof pairs.
We show that considering the assignment problem globally and using weighted bipartite matching algorithms helps a lot in tackling the task.
arXiv Detail & Related papers (2021-02-03T15:38:54Z) - Be More with Less: Hypergraph Attention Networks for Inductive Text
Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task.
Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words.
We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z) - Machine Number Sense: A Dataset of Visual Arithmetic Problems for
Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG)
These visual arithmetic problems are in the form of geometric figures.
We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.