VSEC: Transformer-based Model for Vietnamese Spelling Correction
- URL: http://arxiv.org/abs/2111.00640v1
- Date: Mon, 1 Nov 2021 00:55:32 GMT
- Title: VSEC: Transformer-based Model for Vietnamese Spelling Correction
- Authors: Dinh-Truong Do, Ha Thanh Nguyen, Thang Ngoc Bui, Dinh Hieu Vo
- Abstract summary: We propose a novel method to correct Vietnamese spelling errors.
We tackle the problems of mistyped errors and misspelled errors by using a deep learning model.
The experimental results show that our method achieves encouraging performance with 86.8% errors detected and 81.5% errors corrected.
- Score: 0.19116784879310028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Spelling error correction is one of topics which have a long history in
natural language processing. Although previous studies have achieved remarkable
results, challenges still exist. In the Vietnamese language, a state-of-the-art
method for the task infers a syllable's context from its adjacent syllables.
The method's accuracy can be unsatisfactory, however, because the model may
lose the context if two (or more) spelling mistakes stand near each other. In
this paper, we propose a novel method to correct Vietnamese spelling errors. We
tackle the problems of mistyped errors and misspelled errors by using a deep
learning model. The embedding layer, in particular, is powered by the byte pair
encoding technique. The sequence to sequence model based on the Transformer
architecture makes our approach different from the previous works on the same
problem. In the experiment, we train the model with a large synthetic dataset,
which is randomly introduced spelling errors. We test the performance of the
proposed method using a realistic dataset. This dataset contains 11,202
human-made misspellings in 9,341 different Vietnamese sentences. The
experimental results show that our method achieves encouraging performance with
86.8% errors detected and 81.5% errors corrected, which improves the
state-of-the-art approach 5.6% and 2.2%, respectively.
Related papers
- Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems [47.753284211200665]
We focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage.
This data consists of erroneous solution steps immediately followed by their corrections.
We show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy.
arXiv Detail & Related papers (2024-08-29T06:49:20Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Persian Typographical Error Type Detection Using Deep Neural Networks on Algorithmically-Generated Misspellings [2.2503811834154104]
Typographical Error Type Detection in Persian is a relatively understudied area.
This paper presents a compelling approach for detecting typographical errors in Persian texts.
The outcomes of our final method proved to be highly competitive, achieving an accuracy of 97.62%, precision of 98.83%, recall of 98.61%, and surpassing others in terms of speed.
arXiv Detail & Related papers (2023-05-19T15:05:39Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Discovering Latent Knowledge in Language Models Without Supervision [72.95136739040676]
Existing techniques for training language models can be misaligned with the truth.
We propose directly finding latent knowledge inside the internal activations of a language model in a purely unsupervised way.
We show that despite using no supervision and no model outputs, our method can recover diverse knowledge represented in large language models.
arXiv Detail & Related papers (2022-12-07T18:17:56Z) - A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages [2.5874041837241304]
Spelling error correction is the task of identifying and rectifying misspelled words in texts.
Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods.
We propose a novel detector-purificator-corrector, DPC based on denoising transformers by addressing previous issues.
arXiv Detail & Related papers (2022-11-07T17:59:05Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - Correcting diacritics and typos with ByT5 transformer model [0.0]
People tend to forgo using diacritics and make typographical errors (typos) when typing.
In this work, we tackle both problems at once by employing newly-developed ByT5 byte-level transformer models.
Our simultaneous diacritics restoration and typos correction approach demonstrates near state-of-the-art performance in 13 languages.
arXiv Detail & Related papers (2022-01-31T13:52:51Z) - Hierarchical Transformer Encoders for Vietnamese Spelling Correction [1.0779600811805266]
We propose a Hierarchical Transformer model for Vietnamese spelling correction problem.
The model consists of multiple Transformer encoders and utilizes both character-level and word-level to detect errors and make corrections.
arXiv Detail & Related papers (2021-05-28T04:09:15Z) - Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction.
Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.