Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern
- URL: http://arxiv.org/abs/2003.03484v2
- Date: Thu, 21 May 2020 15:49:16 GMT
- Title: Synthetic Error Dataset Generation Mimicking Bengali Writing Pattern
- Authors: Md. Habibur Rahman Sifat, Chowdhury Rafeed Rahman, Mohammad Rafsan,
Md. Hasibur Rahman
- Abstract summary: We present an algorithm for automatic misspelled Bengali word generation from correct word.
As part of our analysis, we have formed a list of most commonly used Bengali words.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While writing Bengali using English keyboard, users often make spelling
mistakes. The accuracy of any Bengali spell checker or paragraph correction
module largely depends on the kind of error dataset it is based on. Manual
generation of such error dataset is a cumbersome process. In this research, We
present an algorithm for automatic misspelled Bengali word generation from
correct word through analyzing Bengali writing pattern using QWERTY layout
English keyboard. As part of our analysis, we have formed a list of most
commonly used Bengali words, phonetically similar replaceable clusters,
frequently mispressed replaceable clusters, frequently mispressed insertion
prone clusters and some rules for Juktakkhar (constant letter clusters)
handling while generating errors.
Related papers
- Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction [0.32885740436059047]
This study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT.
ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books.
Our corpus contained 49 of errors, including seven types: orthography, syntax, semantics, punctuation, morphology, and split.
arXiv Detail & Related papers (2024-11-07T10:17:40Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - Probing for the Usage of Grammatical Number [103.8175326220026]
We try to find encodings that the model actually uses, introducing a usage-based probing setup.
We focus on how BERT encodes grammatical number, and on how it uses this encoding to solve the number agreement task.
arXiv Detail & Related papers (2022-04-19T11:59:52Z) - Challenge Dataset of Cognates and False Friend Pairs from Indian
Languages [54.6340870873525]
Cognates are present in multiple variants of the same text across different languages.
In this paper, we describe the creation of two cognate datasets for twelve Indian languages.
arXiv Detail & Related papers (2021-12-17T14:23:43Z) - Scarecrow: A Framework for Scrutinizing Machine Text [69.26985439191151]
We introduce a new structured, crowdsourced error annotation schema called Scarecrow.
Scarecrow collects 13k annotations of 1.3k human and machine generate paragraphs of English language news text.
These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems.
arXiv Detail & Related papers (2021-07-02T22:37:03Z) - Spelling Correction with Denoising Transformer [0.0]
We present a novel method of performing spelling correction on short input strings, such as search queries or individual words.
At its core lies a procedure for generating artificial typos which closely follow the error patterns manifested by humans.
This procedure is used to train the production spelling correction model based on a transformer architecture.
arXiv Detail & Related papers (2021-05-12T21:35:18Z) - Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of
OCR-generated Hindi Text Using BERT and Levenshtein Distance [3.0422254248414276]
Vartani Spellcheck is a context-sensitive approach for spelling correction of Hindi text.
With an accuracy of 81%, the results show a significant improvement over some of the previously established context-sensitive error correction mechanisms for Hindi.
arXiv Detail & Related papers (2020-12-14T15:49:54Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - A Large Multi-Target Dataset of Common Bengali Handwritten Graphemes [1.009810782568186]
We propose a labeling scheme that makes segmentation in-side alpha-syllabary words linear.
The dataset contains 411k curated samples of 1295 unique commonly used Bengali graphemes.
The dataset is open-sourced as a part of a public Handwritten Grapheme Classification Challenge on Kaggle.
arXiv Detail & Related papers (2020-10-01T01:51:45Z) - Development of POS tagger for English-Bengali Code-Mixed data [14.298803822659934]
We have built a system that can POS tag English-Bengali code-mixed data where the Bengali words were written in Roman script.
Our system was checked using 100 manually POS tagged code-mixed sentences and it returned an accuracy of 75.29%.
arXiv Detail & Related papers (2020-07-29T03:42:07Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.