Related papers: MathReader : Text-to-Speech for Mathematical Documents

MathReader : Text-to-Speech for Mathematical Documents

URL: http://arxiv.org/abs/2501.07088v2
Date: Sun, 19 Jan 2025 06:27:48 GMT
Title: MathReader : Text-to-Speech for Mathematical Documents
Authors: Sieun Hyeon, Kyudan Jung, Nam-Joon Kim, Hyun Gon Ryu, Jaeyoung Do,
Abstract summary: We propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS.<n>MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat.<n>This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired.
Score: 2.8522108187031834
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: TTS (Text-to-Speech) document reader from Microsoft, Adobe, Apple, and OpenAI have been serviced worldwide. They provide relatively good TTS results for general plain text, but sometimes skip contents or provide unsatisfactory results for mathematical expressions. This is because most modern academic papers are written in LaTeX, and when LaTeX formulas are compiled, they are rendered as distinctive text forms within the document. However, traditional TTS document readers output only the text as it is recognized, without considering the mathematical meaning of the formulas. To address this issue, we propose MathReader, which effectively integrates OCR, a fine-tuned T5 model, and TTS. MathReader demonstrated a lower Word Error Rate (WER) than existing TTS document readers, such as Microsoft Edge and Adobe Acrobat, when processing documents containing mathematical formulas. MathReader reduced the WER from 0.510 to 0.281 compared to Microsoft Edge, and from 0.617 to 0.281 compared to Adobe Acrobat. This will significantly contribute to alleviating the inconvenience faced by users who want to listen to documents, especially those who are visually impaired. The code is available at https://github.com/hyeonsieun/MathReader.

Related papers

GoMatching++: Parameter- and Data-Efficient Arbitrary-Shaped Video Text Spotting and Benchmarking [77.0306273129475]
Video text spotting (VTS) extends image text spotting (ITS) by adding text tracking.<n>Despite progress in VTS, existing methods still fall short of the performance seen in ITS.<n>GoMatching++ transforms an off-the-shelf image text spotter into a video specialist.
arXiv Detail & Related papers (2025-05-28T11:02:45Z)
MathSpeech: Leveraging Small LMs for Accurate Conversion in Mathematical Speech-to-Formula [10.757551947236879]
MathSpeech is a novel pipeline that integrates ASR models with small Language Models (sLMs) to correct errors in mathematical expressions. MathSpeech demonstrates $La$ generation capabilities comparable to leading commercial Large Language Models (LLMs) MathSpeech demonstrated significantly superior capabilities compared to GPT-4o.
arXiv Detail & Related papers (2024-12-20T08:13:05Z)
Hard-Synth: Synthesizing Diverse Hard Samples for ASR using Zero-Shot TTS and LLM [48.71951982716363]
Text-to-speech (TTS) models have been widely adopted to enhance automatic speech recognition (ASR) systems. We propose Hard- Synth, a novel ASR data augmentation method that leverages large language models (LLMs) and advanced zero-shot TTS. Our approach employs LLMs to generate diverse in-domain text through rewriting, without relying on additional text data.
arXiv Detail & Related papers (2024-11-20T09:49:37Z)
LATTE: Improving Latex Recognition for Tables and Formulae with Iterative Refinement [11.931911831112357]
The source and rendered PDF images look drastically different, especially for formulae and tables.<n>Prior work generates sources in a single iteration and struggles with complex formulae.<n>This paper proposes LATTE, the first iterative refinement framework for recognition.
arXiv Detail & Related papers (2024-09-21T17:18:49Z)
AceParse: A Comprehensive Dataset with Diverse Structured Texts for Academic Literature Parsing [82.33075210051129]
We introduce AceParse, the first comprehensive dataset designed to support the parsing of structured texts. Based on AceParse, we fine-tuned a multimodal model, named Ace, which accurately parses various structured texts. This model outperforms the previous state-of-the-art by 4.1% in terms of F1 score and by 5% in Jaccard Similarity.
arXiv Detail & Related papers (2024-09-16T06:06:34Z)
TeXBLEU: Automatic Metric for Evaluate LaTeX Format [4.337656290539519]
We propose BLEU, a metric for evaluating mathematical expressions in the format built on the n-gram-based BLEU metric. The proposed BLEU consists of a tokenizer trained on the arXiv paper dataset and a fine-tuned embedding model with positional encoding.
arXiv Detail & Related papers (2024-09-10T16:54:32Z)
MathBridge: A Large Corpus Dataset for Translating Spoken Mathematical Expressions into $LaTeX$ Formulas for Improved Readability [10.757551947236879]
We introduce MathBridge, the first extensive dataset for translating mathematical spoken sentences into formulas. MathBridge significantly enhances the capabilities of pretrained language models for converting to formulas from mathematical spoken sentences.
arXiv Detail & Related papers (2024-08-07T18:07:15Z)
MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z)
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition [0.9012198585960439]
MathWriting is the largest online handwritten mathematical expression dataset to date. One MathWriting sample consists of a formula written on a touch screen and a corresponding expression. This dataset can also be used in its rendered form for offline HME recognition.
arXiv Detail & Related papers (2024-04-16T16:10:23Z)
LOCR: Location-Guided Transformer for Optical Character Recognition [55.195165959662795]
We propose LOCR, a model that integrates location guiding into the transformer architecture during autoregression. We train the model on a dataset comprising over 77M text-location pairs from 125K academic document pages, including bounding boxes for words, tables and mathematical symbols. It outperforms all existing methods in our test set constructed from arXiv, as measured by edit distance, BLEU, METEOR and F-measure.
arXiv Detail & Related papers (2024-03-04T15:34:12Z)
Learning Multiplex Representations on Text-Attributed Graphs with One Language Model Encoder [55.24276913049635]
We propose METAG, a new framework for learning Multiplex rEpresentations on Text-Attributed Graphs. In contrast to existing methods, METAG uses one text encoder to model the shared knowledge across relations. We conduct experiments on nine downstream tasks in five graphs from both academic and e-commerce domains.
arXiv Detail & Related papers (2023-10-10T14:59:22Z)
Dynamic Prompt Learning via Policy Gradient for Semi-structured Mathematical Reasoning [150.17907456113537]
We present Tabular Math Word Problems (TabMWP), a new dataset containing 38,431 grade-level problems that require mathematical reasoning. We evaluate different pre-trained models on TabMWP, including the GPT-3 model in a few-shot setting. We propose a novel approach, PromptPG, which utilizes policy gradient to learn to select in-context examples from a small amount of training data.
arXiv Detail & Related papers (2022-09-29T08:01:04Z)
TexSmart: A Text Understanding System for Fine-Grained NER and Enhanced Semantic Analysis [61.28407236720969]
This technique report introduces TexSmart, a text understanding system that supports fine-grained named entity recognition (NER) and enhanced semantic analysis functionalities. TexSmart holds some unique features. First, the NER function of TexSmart supports over 1,000 entity types, while most other public tools typically support several to (at most) dozens of entity types. Second, TexSmart introduces new semantic analysis functions like semantic expansion and deep semantic representation, that are absent in most previous systems.
arXiv Detail & Related papers (2020-12-31T14:58:01Z)
Machine Translation of Mathematical Text [0.0]
We have implemented a machine translation system, the PolyMath Translator, for documents containing mathematical text. The current implementation translates English to French, attaining a BLEU score of 53.5 on a held-out test corpus of mathematical sentences. It produces documents that can be compiled to PDF without further editing.
arXiv Detail & Related papers (2020-10-11T11:59:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.