Related papers: Disambiguating Symbolic Expressions in Informal Documents

Disambiguating Symbolic Expressions in Informal Documents

URL: http://arxiv.org/abs/2101.11716v1
Date: Mon, 25 Jan 2021 10:14:37 GMT
Title: Disambiguating Symbolic Expressions in Informal Documents
Authors: Dennis M\"uller and Cezary Kaliszyk
Abstract summary: We present a dataset with roughly 33,000 entries. We describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.
Score: 2.423990103106667
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose the task of disambiguating symbolic expressions in informal STEM documents in the form of LaTeX files - that is, determining their precise semantics and abstract syntax tree - as a neural machine translation task. We discuss the distinct challenges involved and present a dataset with roughly 33,000 entries. We evaluated several baseline models on this dataset, which failed to yield even syntactically valid LaTeX before overfitting. Consequently, we describe a methodology using a transformer language model pre-trained on sources obtained from arxiv.org, which yields promising results despite the small size of the dataset. We evaluate our model using a plurality of dedicated techniques, taking the syntax and semantics of symbolic expressions into account.

Related papers

Detection and Measurement of Syntactic Templates in Generated Text [58.111650675717414]
We offer an analysis of syntactic features to characterize general repetition in models. We find that models tend to produce templated text in downstream tasks at a higher rate than what is found in human-reference texts.
arXiv Detail & Related papers (2024-06-28T19:34:23Z)
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition [0.9012198585960439]
MathWriting is the largest online handwritten mathematical expression dataset to date. One MathWriting sample consists of a formula written on a touch screen and a corresponding expression. This dataset can also be used in its rendered form for offline HME recognition.
arXiv Detail & Related papers (2024-04-16T16:10:23Z)
To token or not to token: A Comparative Study of Text Representations for Cross-Lingual Transfer [23.777874316083984]
We propose a scoring Language Quotient metric capable of providing a weighted representation of both zero-shot and few-shot evaluation combined. Our analysis reveals that image-based models excel in cross-lingual transfer when languages are closely related and share visually similar scripts. In dependency parsing tasks where word relationships play a crucial role, models with their character-level focus, outperform others.
arXiv Detail & Related papers (2023-10-12T06:59:10Z)
Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications. Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture. We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z)
Compositionality as Lexical Symmetry [42.37422271002712]
In tasks like semantic parsing, instruction following, and question answering, standard deep networks fail to generalize compositionally from small datasets. We present a domain-general and model-agnostic formulation of compositionality as a constraint on symmetries of data distributions rather than models. We describe a procedure called LEXSYM that discovers these transformations automatically, then applies them to training data for ordinary neural sequence models.
arXiv Detail & Related papers (2022-01-30T21:44:46Z)
Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention [3.785514121306353]
We introduce a system that creates a large set of synthesized training examples of mathematical expressions which are derived from documents. For this purpose, we propose a novel attention-based generative adversarial network to translate rendered equations to handwritten formulas. The datasets generated by this approach contain hundreds of thousands of formulas, making it ideal for pretraining or the design of more complex models.
arXiv Detail & Related papers (2021-06-17T12:27:18Z)
Robust Document Representations using Latent Topics and Metadata [17.306088038339336]
We propose a novel approach to fine-tuning a pre-trained neural language model for document classification problems. We generate document representations that capture both text and metadata artifacts in a task manner. Our solution also incorporates metadata explicitly rather than just augmenting them with text.
arXiv Detail & Related papers (2020-10-23T21:52:38Z)
Exemplar-Controllable Paraphrasing and Translation using Bitext [57.92051459102902]
We adapt models from prior work to be able to learn solely from bilingual text (bitext) Our single proposed model can perform four tasks: controlled paraphrase generation in both languages and controlled machine translation in both language directions.
arXiv Detail & Related papers (2020-10-12T17:02:50Z)
ToTTo: A Controlled Table-To-Text Generation Dataset [61.83159452483026]
ToTTo is an open-domain English table-to-text dataset with over 120,000 training examples. We introduce a dataset construction process where annotators directly revise existing candidate sentences from Wikipedia. While usually fluent, existing methods often hallucinate phrases that are not supported by the table.
arXiv Detail & Related papers (2020-04-29T17:53:45Z)
Extractive Summarization as Text Matching [123.09816729675838]
This paper creates a paradigm shift with regard to the way we build neural extractive summarization systems. We formulate the extractive summarization task as a semantic text matching problem. We have driven the state-of-the-art extractive result on CNN/DailyMail to a new level (44.41 in ROUGE-1)
arXiv Detail & Related papers (2020-04-19T08:27:57Z)
Pre-training for Abstractive Document Summarization by Reinstating Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text. Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.