Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling
Corrector
- URL: http://arxiv.org/abs/2305.08518v1
- Date: Mon, 15 May 2023 10:28:36 GMT
- Title: Beqi: Revitalize the Senegalese Wolof Language with a Robust Spelling
Corrector
- Authors: Derguene Mbaye, Moussa Diallo
- Abstract summary: African languages in particular are still behind and lack automatic processing tools.
We present a way to address the constraint related to the lack of data by generating synthetic data.
We present sequence-to-sequence models using Deep Learning for spelling correction in Wolof.
- Score: 0.40611352512781856
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The progress of Natural Language Processing (NLP), although fast in recent
years, is not at the same pace for all languages. African languages in
particular are still behind and lack automatic processing tools. Some of these
tools are very important for the development of these languages but also have
an important role in many NLP applications. This is particularly the case for
automatic spell checkers. Several approaches have been studied to address this
task and the one modeling spelling correction as a translation task from
misspelled (noisy) text to well-spelled (correct) text shows promising results.
However, this approach requires a parallel corpus of noisy data on the one hand
and correct data on the other hand, whereas Wolof is a low-resource language
and does not have such a corpus. In this paper, we present a way to address the
constraint related to the lack of data by generating synthetic data and we
present sequence-to-sequence models using Deep Learning for spelling correction
in Wolof. We evaluated these models in three different scenarios depending on
the subwording method applied to the data and showed that the latter had a
significant impact on the performance of the models, which opens the way for
future research in Wolof spelling correction.
Related papers
- Corpus-Based Approaches to Igbo Diacritic Restoration [0.23552726065717702]
The capacity of computers to process natural languages is increasing because NLP researchers are pushing its boundaries.<n>Over 95% of the world's 7000 languages are low-resourced for NLP, i.e. they have little or no data, tools, and techniques for NLP work.<n>We present an overview of diacritic ambiguity and a review of previous diacritic disambiguation approaches on other languages.
arXiv Detail & Related papers (2026-01-26T11:30:36Z) - Automatic Correction of Writing Anomalies in Hausa Texts [0.0]
Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors.<n>This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models.
arXiv Detail & Related papers (2025-06-04T10:46:19Z) - Large corpora and large language models: a replicable method for automating grammatical annotation [0.0]
We introduce a methodological pipeline applied to the case study of formal variation in the English evaluative verb construction 'consider X (as) (to be) Y'
We reach a model accuracy of over 90% on our held-out test samples with only a small amount of training data.
We discuss the generalisability of our results for a wider range of case studies of grammatical constructions and grammatical variation and change.
arXiv Detail & Related papers (2024-11-18T03:29:48Z) - Constructing and Expanding Low-Resource and Underrepresented Parallel Datasets for Indonesian Local Languages [0.0]
We introduce Bhinneka Korpus, a multilingual parallel corpus featuring five Indonesian local languages.
Our goal is to enhance access and utilization of these resources, extending their reach within the country.
arXiv Detail & Related papers (2024-04-01T09:24:06Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Automatic Spell Checker and Correction for Under-represented Spoken
Languages: Case Study on Wolof [9.79241237464453]
This paper presents a spell checker and correction tool specifically designed for Wolof, an under-represented spoken language in Africa.
The proposed spell checker leverages a combination of a trie data structure, dynamic programming, and the weighted Levenshtein distance to generate suggestions for misspelled words.
Despite the limited data available for Wolof, the spell checker's performance showed a predictive accuracy of 98.31% and a suggestion accuracy of 93.33%.
arXiv Detail & Related papers (2023-05-22T04:03:20Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - MuCoT: Multilingual Contrastive Training for Question-Answering in
Low-resource Languages [4.433842217026879]
Multi-lingual BERT-based models (mBERT) are often used to transfer knowledge from high-resource languages to low-resource languages.
We augment the QA samples of the target language using translation and transliteration into other languages and use the augmented data to fine-tune an mBERT-based QA model.
Experiments on the Google ChAII dataset show that fine-tuning the mBERT model with translations from the same language family boosts the question-answering performance.
arXiv Detail & Related papers (2022-04-12T13:52:54Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Comparison of Interactive Knowledge Base Spelling Correction Models for
Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict.
This work shows a comparison of a neural model and character language models with varying amounts on target language data.
Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z) - Grounded Compositional Outputs for Adaptive Language Modeling [59.02706635250856]
A language model's vocabulary$-$typically selected before training and permanently fixed later$-$affects its size.
We propose a fully compositional output embedding layer for language models.
To our knowledge, the result is the first word-level language model with a size that does not depend on the training vocabulary.
arXiv Detail & Related papers (2020-09-24T07:21:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.