Automatic Correction of Writing Anomalies in Hausa Texts
- URL: http://arxiv.org/abs/2506.03820v1
- Date: Wed, 04 Jun 2025 10:46:19 GMT
- Title: Automatic Correction of Writing Anomalies in Hausa Texts
- Authors: Ahmad Mustapha Wali, Sergiu Nisioi,
- Abstract summary: Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors.<n>This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors, which sometimes hinder natural language processing (NLP) applications. This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models. Using a corpus gathered from several public sources, we created a large-scale parallel dataset of over 450,000 noisy-clean Hausa sentence pairs by introducing synthetically generated noise, fine-tuned to mimic realistic writing errors. Moreover, we adapted several multilingual and African language-focused models, including M2M100, AfriTEVA, mBART, and Opus-MT variants for this correction task using SentencePiece tokenization. Our experimental results demonstrate significant increases in F1, BLEU and METEOR scores, as well as reductions in Character Error Rate (CER) and Word Error Rate (WER). This research provides a robust methodology, a publicly available dataset, and effective models to improve Hausa text quality, thereby advancing NLP capabilities for the language and offering transferable insights for other low-resource languages.
Related papers
- Detecting Spelling and Grammatical Anomalies in Russian Poetry Texts [0.0]
The quality of natural language texts in fine-tuning datasets plays a critical role in the performance of generative models.<n>We propose the use of automated linguistic anomaly detection to identify and filter out low-quality texts from training datasets for creative models.<n>Our work aims to empower the community with tools and insights to improve the quality of training datasets for generative models in creative domains.
arXiv Detail & Related papers (2025-05-07T15:27:59Z) - Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)<n>We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.<n>This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - Fine-tuning Language Models for Factuality [96.5203774943198]
Large pre-trained language models (LLMs) have led to their widespread use, sometimes even as a replacement for traditional search engines.
Yet language models are prone to making convincing but factually inaccurate claims, often referred to as 'hallucinations'
In this work, we fine-tune language models to be more factual, without human labeling.
arXiv Detail & Related papers (2023-11-14T18:59:15Z) - The Devil is in the Errors: Leveraging Large Language Models for
Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations.
We study the impact of labeled data through in-context learning and finetuning.
We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z) - Robustification of Multilingual Language Models to Real-world Noise with
Robust Contrastive Pretraining [14.087882550564169]
We assess the robustness of neural models on noisy data and suggest improvements are limited to the English language.
To benchmark the performance of pretrained multilingual models, we construct noisy datasets covering five languages and four NLP tasks.
We propose Robust Contrastive Pretraining (RCP) to boost the zero-shot cross-lingual robustness of multilingual pretrained models.
arXiv Detail & Related papers (2022-10-10T15:40:43Z) - Improving Speech Recognition for Indic Languages using Language Model [0.0]
We study the effect of applying a language model (LM) on the output of Automatic Speech Recognition (ASR) systems for Indic languages.
We fine-tune wav2vec $2.0$ models for $18$ Indic languages and adjust the results with language models trained on text derived from a variety of sources.
arXiv Detail & Related papers (2022-03-30T18:22:12Z) - Understanding Model Robustness to User-generated Noisy Texts [2.958690090551675]
In NLP, model performance often deteriorates with naturally occurring noise, such as spelling errors.
We propose to model the errors statistically from grammatical-error-correction corpora.
arXiv Detail & Related papers (2021-10-14T14:54:52Z) - TextFlint: Unified Multilingual Robustness Evaluation Toolkit for
Natural Language Processing [73.16475763422446]
We propose a multilingual robustness evaluation platform for NLP tasks (TextFlint)
It incorporates universal text transformation, task-specific transformation, adversarial attack, subpopulation, and their combinations to provide comprehensive robustness analysis.
TextFlint generates complete analytical reports as well as targeted augmented data to address the shortcomings of the model's robustness.
arXiv Detail & Related papers (2021-03-21T17:20:38Z) - Improving Translation Robustness with Visual Cues and Error Correction [58.97421756225425]
We introduce the idea of visual context to improve translation robustness against noisy texts.
We also propose a novel error correction training regime by treating error correction as an auxiliary task.
arXiv Detail & Related papers (2021-03-12T15:31:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.