Automatic Spell Checker and Correction for Under-represented Spoken
Languages: Case Study on Wolof
- URL: http://arxiv.org/abs/2305.12694v1
- Date: Mon, 22 May 2023 04:03:20 GMT
- Title: Automatic Spell Checker and Correction for Under-represented Spoken
Languages: Case Study on Wolof
- Authors: Thierno Ibrahima Ciss\'e and Fatiha Sadat
- Abstract summary: This paper presents a spell checker and correction tool specifically designed for Wolof, an under-represented spoken language in Africa.
The proposed spell checker leverages a combination of a trie data structure, dynamic programming, and the weighted Levenshtein distance to generate suggestions for misspelled words.
Despite the limited data available for Wolof, the spell checker's performance showed a predictive accuracy of 98.31% and a suggestion accuracy of 93.33%.
- Score: 9.79241237464453
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a spell checker and correction tool specifically designed
for Wolof, an under-represented spoken language in Africa. The proposed spell
checker leverages a combination of a trie data structure, dynamic programming,
and the weighted Levenshtein distance to generate suggestions for misspelled
words. We created novel linguistic resources for Wolof, such as a lexicon and a
corpus of misspelled words, using a semi-automatic approach that combines
manual and automatic annotation methods. Despite the limited data available for
the Wolof language, the spell checker's performance showed a predictive
accuracy of 98.31% and a suggestion accuracy of 93.33%. Our primary focus
remains the revitalization and preservation of Wolof as an Indigenous and
spoken language in Africa, providing our efforts to develop novel linguistic
resources. This work represents a valuable contribution to the growth of
computational tools and resources for the Wolof language and provides a strong
foundation for future studies in the automatic spell checking and correction
field.
Related papers
- Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'
We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.
We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - Neural spell-checker: Beyond words with synthetic data generation [0.0]
Spell-checkers are valuable tools that enhance communication by identifying misspelled words in written texts.
Recent improvements in deep learning have opened new opportunities to improve traditional spell-checkers with new functionalities.
We present and compare two new spell-checkers and evaluate them on synthetic, learner, and more general-domain Slovene datasets.
arXiv Detail & Related papers (2024-10-30T23:51:01Z) - From English-Centric to Effective Bilingual: LLMs with Custom Tokenizers for Underrepresented Languages [0.5706164516481158]
We propose a model-agnostic cost-effective approach to developing bilingual base large language models (LLMs) to support English and any target language.
We performed experiments with three languages, each using a non-Latin script - Ukrainian, Arabic, and Georgian.
arXiv Detail & Related papers (2024-10-24T15:20:54Z) - Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling
Corrector [0.40611352512781856]
African languages in particular are still behind and lack automatic processing tools.
We present a way to address the constraint related to the lack of data by generating synthetic data.
We present sequence-to-sequence models using Deep Learning for spelling correction in Wolof.
arXiv Detail & Related papers (2023-05-15T10:28:36Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Towards Zero-shot Language Modeling [90.80124496312274]
We construct a neural model that is inductively biased towards learning human languages.
We infer this distribution from a sample of typologically diverse training languages.
We harness additional language-specific side information as distant supervision for held-out languages.
arXiv Detail & Related papers (2021-08-06T23:49:18Z) - Constrained Language Models Yield Few-Shot Semantic Parsers [73.50960967598654]
We explore the use of large pretrained language models as few-shot semantics.
The goal in semantic parsing is to generate a structured meaning representation given a natural language input.
We use language models to paraphrase inputs into a controlled sublanguage resembling English that can be automatically mapped to a target meaning representation.
arXiv Detail & Related papers (2021-04-18T08:13:06Z) - NeuSpell: A Neural Spelling Correction Toolkit [88.79419580807519]
NeuSpell is an open-source toolkit for spelling correction in English.
It comprises ten different models, and benchmarks them on misspellings from multiple sources.
We train neural models using spelling errors in context, synthetically constructed by reverse engineering isolated misspellings.
arXiv Detail & Related papers (2020-10-21T15:53:29Z) - Improving Yor\`ub\'a Diacritic Restoration [3.301896537513352]
Yorub'a is a widely spoken West African language with a writing system rich in orthographic and tonal diacritics.
Diacritic marks are commonly excluded from electronic texts due to limited device and application support as well as general education on proper usage.
All pre-trained models, datasets and source-code have been released as an open-source project to advance efforts on Yorub'a language technology.
arXiv Detail & Related papers (2020-03-23T22:07:15Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.