Correcting diacritics and typos with ByT5 transformer model
- URL: http://arxiv.org/abs/2201.13242v1
- Date: Mon, 31 Jan 2022 13:52:51 GMT
- Title: Correcting diacritics and typos with ByT5 transformer model
- Authors: Lukas Stankevi\v{c}ius, Mantas Luko\v{s}evi\v{c}ius, Jurgita
Kapo\v{c}i\=ut\.e-Dzikien\.e, Monika Briedien\.e, Tomas Krilavi\v{c}ius
- Abstract summary: People tend to forgo using diacritics and make typographical errors (typos) when typing.
In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models.
Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the fast pace of life and online communications, the prevalence of
English and the QWERTY keyboard, people tend to forgo using diacritics, make
typographical errors (typos) when typing. Restoring diacritics and correcting
spelling is important for proper language use and disambiguation of texts for
both humans and downstream algorithms. However, both of these problems are
typically addressed separately, i.e., state-of-the-art diacritics restoration
methods do not tolerate other typos. In this work, we tackle both problems at
once by employing newly-developed ByT5 byte-level transformer models. Our
simultaneous diacritics restoration and typos correction approach demonstrates
near state-of-the-art performance in 13 languages, reaching >96% of the
alpha-word accuracy. We also perform diacritics restoration alone on 12
benchmark datasets with the additional one for the Lithuanian language. The
experimental investigation proves that our approach is able to achieve
comparable results (>98%) to previously reported despite being trained on fewer
data. Our approach is also able to restore diacritics in words not seen during
training with >76% accuracy. We also show the accuracies to further improve
with longer training. All this shows a great real-world application potential
of our suggested methods to more data, languages, and error classes.
Related papers
- Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages [2.5874041837241304]
Spelling error correction is the task of identifying and rectifying misspelled words in texts.
Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods.
We propose a novel detector-purificator-corrector, DPC based on denoising transformers by addressing previous issues.
arXiv Detail & Related papers (2022-11-07T17:59:05Z) - VSEC: Transformer-based Model for Vietnamese Spelling Correction [0.19116784879310028]
We propose a novel method to correct Vietnamese spelling errors.
We tackle the problems of mistyped errors and misspelled errors by using a deep learning model.
The experimental results show that our method achieves encouraging performance with 86.8% errors detected and 81.5% errors corrected.
arXiv Detail & Related papers (2021-11-01T00:55:32Z) - Diacritics Restoration using BERT with Analysis on Czech language [3.2729625923640278]
We propose a new architecture for diacritics restoration based on contextualized embeddings, namely BERT.
We conduct a detailed error analysis on Czech, a morphologically rich language with a high level of diacritization.
arXiv Detail & Related papers (2021-05-24T16:58:27Z) - Unsupervised Speech Recognition [55.864459085947345]
wav2vec-U, short for wav2vec Unsupervised, is a method to train speech recognition models without any labeled data.
We leverage self-supervised speech representations to segment unlabeled audio and learn a mapping from these representations to phonemes via adversarial training.
On the larger English Librispeech benchmark, wav2vec-U achieves a word error rate of 5.9 on test-other, rivaling some of the best published systems trained on 960 hours of labeled data from only two years ago.
arXiv Detail & Related papers (2021-05-24T04:10:47Z) - Spelling Correction with Denoising Transformer [0.0]
We present a novel method of performing spelling correction on short input strings, such as search queries or individual words.
At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans.
This procedure is used to train the production spelling correction model based on a transformer architecture.
arXiv Detail & Related papers (2021-05-12T21:35:18Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - A Multitask Learning Approach for Diacritic Restoration [21.288912928687186]
In many languages like Arabic, diacritics are used to specify pronunciations as well as meanings.
Such diacritics are often omitted in written text, increasing the number of possible pronunciations and meanings for a word.
We use Arabic as a case study since it has sufficient data resources for tasks that we consider in our joint modeling.
arXiv Detail & Related papers (2020-06-07T01:20:40Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Towards Zero-shot Learning for Automatic Phonemic Transcription [82.9910512414173]
A more challenging problem is to build phonemic transcribers for languages with zero training data.
Our model is able to recognize unseen phonemes in the target language without any training data.
It achieves 7.7% better phoneme error rate on average over a standard multilingual model.
arXiv Detail & Related papers (2020-02-26T20:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.