Related papers: Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles

Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles

URL: http://arxiv.org/abs/2209.00446v1
Date: Mon, 22 Aug 2022 12:11:30 GMT
Title: Self-Supervised Pretraining of Graph Neural Network for the Retrieval of Related Mathematical Expressions in Scientific Articles
Authors: Lukas Pfahler, Katharina Morik
Abstract summary: We propose a new approach for retrieval of mathematical expressions based on machine learning. We design an unsupervised representation learning task that combines embedding learning with self-supervised learning. We collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org.
Score: 8.942112181408156
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Given the increase of publications, search for relevant papers becomes tedious. In particular, search across disciplines or schools of thinking is not supported. This is mainly due to the retrieval with keyword queries: technical terms differ in different sciences or at different times. Relevant articles might better be identified by their mathematical problem descriptions. Just looking at the equations in a paper already gives a hint to whether the paper is relevant. Hence, we propose a new approach for retrieval of mathematical expressions based on machine learning. We design an unsupervised representation learning task that combines embedding learning with self-supervised learning. Using graph convolutional neural networks we embed mathematical expression into low-dimensional vector spaces that allow efficient nearest neighbor queries. To train our models, we collect a huge dataset with over 29 million mathematical expressions from over 900,000 publications published on arXiv.org. The math is converted into an XML format, which we view as graph data. Our empirical evaluations involving a new dataset of manually annotated search queries show the benefits of using embedding models for mathematical retrieval. This work was originally published at KDD 2020.

Related papers

MAMUT: A Novel Framework for Modifying Mathematical Formulas for the Generation of Specialized Datasets for Language Model Training [7.164697875838552]
This study focuses on the development of specialized training datasets to enhance the encoding of mathematical content.<n>We introduce Math Mutator (MAMUT), a framework capable of generating equivalent and falsified versions of a given mathematical formula in notation.<n>Experiments show that models trained on these datasets exhibit new SoTA performance on mathematical retrieval tasks.
arXiv Detail & Related papers (2025-02-28T08:53:42Z)
Data for Mathematical Copilots: Better Ways of Presenting Proofs for Machine Learning [85.635988711588]
We argue that enhancing the capabilities of large language models requires a paradigm shift in the design of mathematical datasets. We advocate for mathematical dataset developers to consider the concept of "motivated proof", introduced by G. P'olya in 1949, which can serve as a blueprint for datasets that offer a better proof learning signal. We provide a questionnaire designed specifically for math datasets that we urge creators to include with their datasets.
arXiv Detail & Related papers (2024-12-19T18:55:17Z)
Automated conjecturing in mathematics with \emph{TxGraffiti} [0.0]
emphTxGraffiti is a data-driven computer program developed to automate the process of generating conjectures. We present the design and core principles of emphTxGraffiti, including its roots in the original emphGraffiti program.
arXiv Detail & Related papers (2024-09-28T15:06:31Z)
Discovering symbolic expressions with parallelized tree search [59.92040079807524]
Symbolic regression plays a crucial role in scientific research thanks to its capability of discovering concise and interpretable mathematical expressions from data. Existing algorithms have faced a critical bottleneck of accuracy and efficiency over a decade when handling problems of complexity. We introduce a parallelized tree search (PTS) model to efficiently distill generic mathematical expressions from limited data.
arXiv Detail & Related papers (2024-07-05T10:41:15Z)
Artificial intelligence and machine learning generated conjectures with TxGraffiti [0.0]
We outline the machine learning and techniques implemented by TxGraffiti. We also announce a new online version of the program available for anyone curious to explore conjectures in graph theory.
arXiv Detail & Related papers (2024-07-03T01:03:09Z)
OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text [32.15651290548974]
We introduce OpenWebMath, an open dataset inspired by works containing 14.7B tokens of webpages from Common Crawl. We run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data.
arXiv Detail & Related papers (2023-10-10T16:57:28Z)
Conversational Semantic Parsing using Dynamic Context Graphs [68.72121830563906]
We consider the task of conversational semantic parsing over general purpose knowledge graphs (KGs) with millions of entities, and thousands of relation-types. We focus on models which are capable of interactively mapping user utterances into executable logical forms.
arXiv Detail & Related papers (2023-05-04T16:04:41Z)
Scientific Paper Extractive Summarization Enhanced by Citation Graphs [50.19266650000948]
We focus on leveraging citation graphs to improve scientific paper extractive summarization under different settings. Preliminary results demonstrate that citation graph is helpful even in a simple unsupervised framework. Motivated by this, we propose a Graph-based Supervised Summarization model (GSS) to achieve more accurate results on the task when large-scale labeled data are available.
arXiv Detail & Related papers (2022-12-08T11:53:12Z)
Towards Math-Aware Automated Classification and Similarity Search of Scientific Publications: Methods of Mathematical Content Representations [0.456877715768796]
We investigate mathematical content representations suitable for the automated classification of and the similarity search in STEM documents. The methods are evaluated on a subset of arXiv.org papers with the Mathematics Subject Classification (MSC) as a reference classification.
arXiv Detail & Related papers (2021-10-08T11:27:40Z)
Temporal Graph Network Embedding with Causal Anonymous Walks Representations [54.05212871508062]
We propose a novel approach for dynamic network representation learning based on Temporal Graph Network. For evaluation, we provide a benchmark pipeline for the evaluation of temporal network embeddings. We show the applicability and superior performance of our model in the real-world downstream graph machine learning task provided by one of the top European banks.
arXiv Detail & Related papers (2021-08-19T15:39:52Z)
Learning to Match Mathematical Statements with Proofs [37.38969121408295]
The task is designed to improve the processing of research-level mathematical texts. We release a dataset for the task, consisting of over 180k statement-proof pairs. We show that considering the assignment problem globally and using weighted bipartite matching algorithms helps a lot in tackling the task.
arXiv Detail & Related papers (2021-02-03T15:38:54Z)
Be More with Less: Hypergraph Attention Networks for Inductive Text Classification [56.98218530073927]
Graph neural networks (GNNs) have received increasing attention in the research community and demonstrated their promising results on this canonical task. Despite the success, their performance could be largely jeopardized in practice since they are unable to capture high-order interaction between words. We propose a principled model -- hypergraph attention networks (HyperGAT) which can obtain more expressive power with less computational consumption for text representation learning.
arXiv Detail & Related papers (2020-11-01T00:21:59Z)
Machine Number Sense: A Dataset of Visual Arithmetic Problems for Abstract and Relational Reasoning [95.18337034090648]
We propose a dataset, Machine Number Sense (MNS), consisting of visual arithmetic problems automatically generated using a grammar model--And-Or Graph (AOG) These visual arithmetic problems are in the form of geometric figures. We benchmark the MNS dataset using four predominant neural network models as baselines in this visual reasoning task.
arXiv Detail & Related papers (2020-04-25T17:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.