MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
- URL: http://arxiv.org/abs/2412.15655v2
- Date: Sun, 19 Jan 2025 07:03:17 GMT
- Title: MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula
- Authors: Sieun Hyeon, Kyudan Jung, Jaehee Won, Nam-Joon Kim, Hyun Gon Ryu, Hyuk-Jae Lee, Jaeyoung Do,
- Abstract summary: MathSpeech is a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions.
MathSpeech demonstrates $La$ generation capabilities comparable to leading commercial Large Language Models (LLMs)
MathSpeech demonstrated significantly superior capabilities compared to GPT-4o.
- Score: 10.757551947236879
- License:
- Abstract: In various academic and professional settings, such as mathematics lectures or research presentations, it is often necessary to convey mathematical expressions orally. However, reading mathematical expressions aloud without accompanying visuals can significantly hinder comprehension, especially for those who are hearing-impaired or rely on subtitles due to language barriers. For instance, when a presenter reads Euler's Formula, current Automatic Speech Recognition (ASR) models often produce a verbose and error-prone textual description (e.g., e to the power of i x equals cosine of x plus i $\textit{side}$ of x), instead of the concise $\LaTeX{}$ format (i.e., $ e^{ix} = \cos(x) + i\sin(x) $), which hampers clear understanding and communication. To address this issue, we introduce MathSpeech, a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions and accurately convert spoken expressions into structured $\LaTeX{}$ representations. Evaluated on a new dataset derived from lecture recordings, MathSpeech demonstrates $\LaTeX{}$ generation capabilities comparable to leading commercial Large Language Models (LLMs), while leveraging fine-tuned small language models of only 120M parameters. Specifically, in terms of CER, BLEU, and ROUGE scores for $\LaTeX{}$ translation, MathSpeech demonstrated significantly superior capabilities compared to GPT-4o. We observed a decrease in CER from 0.390 to 0.298, and higher ROUGE/BLEU scores compared to GPT-4o.
Related papers
- MathReader : Text-to-Speech for Mathematical Documents [2.8522108187031834]
We propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS.
MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat.
This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired.
arXiv Detail & Related papers (2025-01-13T06:47:05Z) - Greek2MathTex: A Greek Speech-to-Text Framework for LaTeX Equations Generation [1.7660225024861564]
We present a novel speech-to-La equations system specifically designed for the Greek language.
We propose an end-to-end system that harnesses the power of Automatic Speech Recognition (ASR) and Natural Language Processing (NLP) techniques.
arXiv Detail & Related papers (2024-12-11T22:29:44Z) - From Language Models over Tokens to Language Models over Characters [54.123846188068384]
Modern language models are internally -- and mathematically -- distributions over token strings rather than emphcharacter strings.
This paper presents algorithms for converting token-level language models to character-level ones.
arXiv Detail & Related papers (2024-12-04T21:19:20Z) - TeXBLEU: Automatic Metric for Evaluate LaTeX Format [4.337656290539519]
We propose BLEU, a metric for evaluating mathematical expressions in the format built on the n-gram-based BLEU metric.
The proposed BLEU consists of a tokenizer trained on the arXiv paper dataset and a fine-tuned embedding model with positional encoding.
arXiv Detail & Related papers (2024-09-10T16:54:32Z) - MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability [10.757551947236879]
We introduce MathBridge, the first extensive dataset for translating mathematical spoken sentences into formulas.
MathBridge significantly enhances the capabilities of pretrained language models for converting to formulas from mathematical spoken sentences.
arXiv Detail & Related papers (2024-08-07T18:07:15Z) - RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis [84.57932472551889]
RALL-E is a robust language modeling method for text-to-speech synthesis.
RALL-E improves the WER of zero-shot TTS from $5.6%$ (without reranking) to $2.5%$ and $1.0%$, respectively.
arXiv Detail & Related papers (2024-04-04T05:15:07Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Evaluating Language Models for Mathematics through Interactions [116.67206980096513]
We introduce CheckMate, a prototype platform for humans to interact with and evaluate large language models (LLMs)
We conduct a study with CheckMate to evaluate three language models (InstructGPT, ChatGPT, and GPT-4) as assistants in proving undergraduate-level mathematics.
We derive a taxonomy of human behaviours and uncover that despite a generally positive correlation, there are notable instances of divergence between correctness and perceived helpfulness.
arXiv Detail & Related papers (2023-06-02T17:12:25Z) - Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple
Tasks [77.90900650816046]
We introduce $textZemi$, a zero-shot semi-parametric language model.
We train $textZemi$ with a novel semi-parametric multitask prompted training paradigm.
Specifically, we augment the multitask training and zero-shot evaluation with retrieval from a large-scale task-agnostic unlabeled corpus.
arXiv Detail & Related papers (2022-10-01T04:08:50Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - A Transformer-based Math Language Model for Handwritten Math Expression
Recognition [7.202733269706245]
Math symbols are very similar in the writing style, such as dot and comma or 0, O, and o.
This paper presents a Transformer-based Math Language Model (TMLM)
TMLM achieved the perplexity of 4.42, which outperformed the previous math language models.
arXiv Detail & Related papers (2021-08-11T03:03:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.