Optimized Tokenization for Transcribed Error Correction
- URL: http://arxiv.org/abs/2310.10704v1
- Date: Mon, 16 Oct 2023 12:14:21 GMT
- Title: Optimized Tokenization for Transcribed Error Correction
- Authors: Tomer Wullach, Shlomo E. Chazan
- Abstract summary: We show that the performance of correction models can be significantly increased by training solely using synthetic data.
Specifically, we show that synthetic data generated using the error distribution derived from a set of transcribed data outperforms the common approach of applying random perturbations.
- Score: 10.297878672883973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The challenges facing speech recognition systems, such as variations in
pronunciations, adverse audio conditions, and the scarcity of labeled data,
emphasize the necessity for a post-processing step that corrects recurring
errors. Previous research has shown the advantages of employing dedicated error
correction models, yet training such models requires large amounts of labeled
data which is not easily obtained. To overcome this limitation, synthetic
transcribed-like data is often utilized, however, bridging the distribution gap
between transcribed errors and synthetic noise is not trivial. In this paper,
we demonstrate that the performance of correction models can be significantly
increased by training solely using synthetic data. Specifically, we empirically
show that: (1) synthetic data generated using the error distribution derived
from a set of transcribed data outperforms the common approach of applying
random perturbations; (2) applying language-specific adjustments to the
vocabulary of a BPE tokenizer strike a balance between adapting to unseen
distributions and retaining knowledge of transcribed errors. We showcase the
benefits of these key observations, and evaluate our approach using multiple
languages, speech recognition systems and prominent speech recognition
datasets.
Related papers
- Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model [0.0]
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains.
We propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters.
arXiv Detail & Related papers (2024-10-24T01:58:11Z) - Improving Grammatical Error Correction via Contextual Data Augmentation [49.746484518527716]
We propose a synthetic data construction method based on contextual augmentation.
Specifically, we combine rule-based substitution with model-based generation.
We also propose a relabeling-based data cleaning method to mitigate the effects of noisy labels in synthetic data.
arXiv Detail & Related papers (2024-06-25T10:49:56Z) - Parameter-tuning-free data entry error unlearning with adaptive
selective synaptic dampening [51.34904967046097]
We introduce an extension to the selective synaptic dampening unlearning method that removes the need for parameter tuning.
We demonstrate the performance of this extension, adaptive selective synaptic dampening (ASSD) on various ResNet18 and Vision Transformer unlearning tasks.
The application of this approach is particularly compelling in industrial settings, such as supply chain management.
arXiv Detail & Related papers (2024-02-06T14:04:31Z) - Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by
Self-Supervised Representation Mixing and Embedding Initialization [57.38123229553157]
This paper presents an effective transfer learning framework for language adaptation in text-to-speech systems.
We focus on achieving language adaptation using minimal labeled and unlabeled data.
Experimental results show that our framework is able to synthesize intelligible speech in unseen languages with only 4 utterances of labeled data and 15 minutes of unlabeled data.
arXiv Detail & Related papers (2024-01-23T21:55:34Z) - Generative error correction for code-switching speech recognition using
large language models [49.06203730433107]
Code-switching (CS) speech refers to the phenomenon of mixing two or more languages within the same sentence.
We propose to leverage large language models (LLMs) and lists of hypotheses generated by an ASR to address the CS problem.
arXiv Detail & Related papers (2023-10-17T14:49:48Z) - HyPoradise: An Open Baseline for Generative Speech Recognition with
Large Language Models [81.56455625624041]
We introduce the first open-source benchmark to utilize external large language models (LLMs) for ASR error correction.
The proposed benchmark contains a novel dataset, HyPoradise (HP), encompassing more than 334,000 pairs of N-best hypotheses.
LLMs with reasonable prompt and its generative capability can even correct those tokens that are missing in N-best list.
arXiv Detail & Related papers (2023-09-27T14:44:10Z) - Detecting Label Errors using Pre-Trained Language Models [37.82128817976385]
We show that large pre-trained language models are extremely capable of identifying label errors in datasets.
We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP.
arXiv Detail & Related papers (2022-05-25T11:59:39Z) - Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition [18.924716098922683]
Machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions.
We propose two novel techniques during training to mitigate the problems due to the distribution gap.
We show that these methods significantly improve the training of speech recognition models using synthetic data.
arXiv Detail & Related papers (2021-10-21T21:11:42Z) - Empirical Error Modeling Improves Robustness of Noisy Neural Sequence
Labeling [26.27504889360246]
We propose an empirical error generation approach that employs a sequence-to-sequence model trained to perform translation from error-free to erroneous text.
To overcome the data sparsity problem that exacerbates in the case of imperfect textual input, we learned noisy language model-based embeddings.
Our approach outperformed the baseline noise generation and error correction techniques on the erroneous sequence labeling data sets.
arXiv Detail & Related papers (2021-05-25T12:15:45Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.