Machine Translation of Mathematical Text
- URL: http://arxiv.org/abs/2010.05229v1
- Date: Sun, 11 Oct 2020 11:59:40 GMT
- Title: Machine Translation of Mathematical Text
- Authors: Aditya Ohri and Tanya Schmah
- Abstract summary: We have implemented a machine translation system, the PolyMath Translator, for documents containing mathematical text.
The current implementation translates English to French, attaining a BLEU score of 53.5 on a held-out test corpus of mathematical sentences.
It produces documents that can be compiled to PDF without further editing.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We have implemented a machine translation system, the PolyMath Translator,
for LaTeX documents containing mathematical text. The current implementation
translates English LaTeX to French LaTeX, attaining a BLEU score of 53.5 on a
held-out test corpus of mathematical sentences. It produces LaTeX documents
that can be compiled to PDF without further editing. The system first converts
the body of an input LaTeX document into English sentences containing math
tokens, using the pandoc universal document converter to parse LaTeX input. We
have trained a Transformer-based translator model, using OpenNMT, on a combined
corpus containing a small proportion of domain-specific sentences. Our full
system uses both this Transformer model and Google Translate, the latter being
used as a backup to better handle linguistic features that do not appear in our
training dataset. If the Transformer model does not have confidence in its
translation, as determined by a high perplexity score, then we use Google
Translate with a custom glossary. This backup was used 26% of the time on our
test corpus of mathematical sentences. The PolyMath Translator is available as
a web service at www.polymathtrans.ai.
Related papers
- MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math [52.66190891388847]
We introduce textscMathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
Our meticulous data collection and processing efforts included a complex suite of preprocessing.
We hope our textscMathPile can help to enhance the mathematical reasoning abilities of language models.
arXiv Detail & Related papers (2023-12-28T16:55:40Z) - English to Arabic machine translation of mathematical documents [0.0]
This paper focuses on translating English LATEX mathematical documents into Arabic LATEX.
The proposed system leverages a Transformer model as the core of the translation system.
The integration of RyDArab, an Arabic mathematical TEX extension, along with a rule-based translator for Arabic mathematical expressions, contributes to the precise rendering of complex mathematical symbols and equations in the translated output.
arXiv Detail & Related papers (2023-12-02T21:02:07Z) - Neural Machine Translation for Mathematical Formulae [8.608288231153304]
We tackle the problem of neural machine translation of mathematical formulae between ambiguous presentation languages and unambiguous content languages.
We find that convolutional sequence-to-sequence networks achieve 95.1% and 90.7% exact matches, respectively.
arXiv Detail & Related papers (2023-05-25T19:15:06Z) - A Bilingual Parallel Corpus with Discourse Annotations [82.07304301996562]
This paper describes BWB, a large parallel corpus first introduced in Jiang et al. (2022), along with an annotated test set.
The BWB corpus consists of Chinese novels translated by experts into English, and the annotated test set is designed to probe the ability of machine translation systems to model various discourse phenomena.
arXiv Detail & Related papers (2022-10-26T12:33:53Z) - BitextEdit: Automatic Bitext Editing for Improved Low-Resource Machine
Translation [53.55009917938002]
We propose to refine the mined bitexts via automatic editing.
Experiments demonstrate that our approach successfully improves the quality of CCMatrix mined bitext for 5 low-resource language-pairs and 10 translation directions by up to 8 BLEU points.
arXiv Detail & Related papers (2021-11-12T16:00:39Z) - WeTS: A Benchmark for Translation Suggestion [32.10692757420455]
We create a benchmark data set for Translation Suggestion (TS) called emphWeTS.
We also propose several novel methods to generate synthetic corpus which can substantially improve the performance of TS.
Our model achieves State-Of-The-Art (SOTA) results on all four translation directions, including English-to-German, German-to-English, Chinese-to-English and English-to-Chinese.
arXiv Detail & Related papers (2021-10-11T10:52:17Z) - XLM-T: Scaling up Multilingual Machine Translation with Pretrained
Cross-lingual Transformer Encoders [89.0059978016914]
We present XLM-T, which initializes the model with an off-the-shelf pretrained cross-lingual Transformer and fine-tunes it with multilingual parallel data.
This simple method achieves significant improvements on a WMT dataset with 10 language pairs and the OPUS-100 corpus with 94 pairs.
arXiv Detail & Related papers (2020-12-31T11:16:51Z) - Reproducible Science with LaTeX [4.09920839425892]
This paper proposes a procedure to execute external source codes from a document.
It includes the calculation outputs in the resulting Portable Document Format (pdf) file automatically.
arXiv Detail & Related papers (2020-10-04T04:04:07Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - Bootstrapping a Crosslingual Semantic Parser [74.99223099702157]
We adapt a semantic trained on a single language, such as English, to new languages and multiple domains with minimal annotation.
We query if machine translation is an adequate substitute for training data, and extend this to investigate bootstrapping using joint training with English, paraphrasing, and multilingual pre-trained models.
arXiv Detail & Related papers (2020-04-06T12:05:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.