MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability
- URL: http://arxiv.org/abs/2408.07081v3
- Date: Fri, 16 Aug 2024 09:54:23 GMT
- Title: MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability
- Authors: Kyudan Jung, Sieun Hyeon, Jeong Youn Kwon, Nam-Joon Kim, Hyun Gon Ryu, Hyuk-Jae Lee, Jaeyoung Do,
- Abstract summary: We introduce MathBridge, the first extensive dataset for translating mathematical spoken sentences into formulas.
MathBridge significantly enhances the capabilities of pretrained language models for converting to formulas from mathematical spoken sentences.
- Score: 10.757551947236879
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Improving the readability of mathematical expressions in text-based document such as subtitle of mathematical video, is an significant task. To achieve this, mathematical expressions should be convert to compiled formulas. For instance, the spoken expression ``x equals minus b plus or minus the square root of b squared minus four a c, all over two a'' from automatic speech recognition is more readily comprehensible when displayed as a compiled formula $x = \frac{-b \pm \sqrt{b^2 - 4ac}}{2a}$. To convert mathematical spoken sentences to compiled formulas, two processes are required: spoken sentences are converted into LaTeX formulas, and LaTeX formulas are converted into compiled formulas. The latter can be managed by using LaTeX engines. However, there is no way to do the former effectively. Even if we try to solve this using language models, there is no paired data between spoken sentences and LaTeX formulas to train it. In this paper, we introduce MathBridge, the first extensive dataset for translating mathematical spoken sentences into LaTeX formulas. MathBridge comprises approximately 23 million LaTeX formulas paired with the corresponding mathematical spoken sentences. Through comprehensive evaluations, including fine-tuning with proposed data, we discovered that MathBridge significantly enhances the capabilities of pretrained language models for converting to LaTeX formulas from mathematical spoken sentences. Specifically, for the T5-large model, the sacreBLEU score increased from 4.77 to 46.8, demonstrating substantial enhancement.
Related papers
- ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning [54.70811660561151]
Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples.
We seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program.
We observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
arXiv Detail & Related papers (2024-10-24T18:02:37Z) - LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement [11.931911831112357]
LATTE improves the source extraction accuracy of both formulae and tables, outperforming existing techniques as well as GPT-4V.
This paper proposes LATTE, the first iterative refinement framework for recognition.
arXiv Detail & Related papers (2024-09-21T17:18:49Z) - TeXBLEU: Automatic Metric for Evaluate LaTeX Format [4.337656290539519]
We propose BLEU, a metric for evaluating mathematical expressions in the format built on the n-gram-based BLEU metric.
The proposed BLEU consists of a tokenizer trained on the arXiv paper dataset and a fine-tuned embedding model with positional encoding.
arXiv Detail & Related papers (2024-09-10T16:54:32Z) - Towards Semantic Markup of Mathematical Documents via User Interaction [0.0]
We present an approach to semantic markup of formulas by (semi-)automatically generating grammars from existing s macro definitions and parsing formulas with them.
We also present a GUI-based tool for the disambiguation of parse results and showcase its potential using a grammar for parsing untyped $lambda$-terms.
arXiv Detail & Related papers (2024-08-05T12:36:40Z) - MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - MathScale: Scaling Instruction Tuning for Mathematical Reasoning [70.89605383298331]
Large language models (LLMs) have demonstrated remarkable capabilities in problem-solving.
However, their proficiency in solving mathematical problems remains inadequate.
We propose MathScale, a simple and scalable method to create high-quality mathematical reasoning data.
arXiv Detail & Related papers (2024-03-05T11:42:59Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct [128.89645483139236]
We present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math.
Our model even surpasses ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci, PaLM-1 and GPT-3 on MATH.
arXiv Detail & Related papers (2023-08-18T14:23:21Z) - Tree-Based Representation and Generation of Natural and Mathematical
Language [77.34726150561087]
Mathematical language in scientific communications and educational scenarios is important yet relatively understudied.
Recent works on mathematical language focus either on representing stand-alone mathematical expressions, or mathematical reasoning in pre-trained natural language models.
We propose a series of modifications to existing language models to jointly represent and generate text and math.
arXiv Detail & Related papers (2023-02-15T22:38:34Z) - JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem
Understanding [74.12405417718054]
This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model(PLM)
Unlike other standard NLP tasks, mathematical texts are difficult to understand, since they involve mathematical terminology, symbols and formulas in the problem statement.
We design a novel curriculum pre-training approach for improving the learning of mathematical PLMs, consisting of both basic and advanced courses.
arXiv Detail & Related papers (2022-06-13T17:03:52Z) - Machine Translation of Mathematical Text [0.0]
We have implemented a machine translation system, the PolyMath Translator, for documents containing mathematical text.
The current implementation translates English to French, attaining a BLEU score of 53.5 on a held-out test corpus of mathematical sentences.
It produces documents that can be compiled to PDF without further editing.
arXiv Detail & Related papers (2020-10-11T11:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.