Leveraging LLMs for Bangla Grammar Error Correction:Error Categorization, Synthetic Data, and Model Evaluation
- URL: http://arxiv.org/abs/2406.14284v2
- Date: Thu, 05 Jun 2025 14:17:05 GMT
- Title: Leveraging LLMs for Bangla Grammar Error Correction:Error Categorization, Synthetic Data, and Model Evaluation
- Authors: Pramit Bhattacharyya, Arnab Bhattacharya,
- Abstract summary: Despite being the fifth most-spoken language globally, Grammatical Error Correction (GEC) in Bangla remains underdeveloped.<n>We first do an extensive categorization of 12 error classes in Bangla, and take a survey of native Bangla speakers to collect real-world errors.<n>We then devise a rule-based noise injection method to create grammatically incorrect sentences corresponding to correct ones.<n>This dataset is then used to instruction-tune LLMs for the task of GEC in Bangla.
- Score: 3.9018931027384056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) perform exceedingly well in Natural Language Understanding (NLU) tasks for many languages including English. However, despite being the fifth most-spoken language globally, Grammatical Error Correction (GEC) in Bangla remains underdeveloped. In this work, we investigate how LLMs can be leveraged for improving Bangla GEC. For that, we first do an extensive categorization of 12 error classes in Bangla, and take a survey of native Bangla speakers to collect real-world errors. We next devise a rule-based noise injection method to create grammatically incorrect sentences corresponding to correct ones. The Vaiyakarana dataset, thus created, consists of 5,67,422 sentences of which 2,27,119 are erroneous. This dataset is then used to instruction-tune LLMs for the task of GEC in Bangla. Evaluations show that instruction-tuning with \name improves GEC performance of LLMs by 3-7 percentage points as compared to the zero-shot setting, and makes them achieve human-like performance in grammatical error identification. Humans, though, remain superior in error correction.
Related papers
- Performance Evaluation of Large Language Models in Bangla Consumer Health Query Summarization [1.2289361708127877]
This study investigates the zero-shot performance of nine advanced large language models (LLMs)<n>We benchmarked these LLMs using ROUGE metrics against Bangla T5, a fine-tuned state-of-the-art model.<n>The results demonstrate that zero-shot LLMs can rival fine-tuned models, achieving high-quality summaries even without task-specific training.
arXiv Detail & Related papers (2025-05-08T09:06:28Z) - Bridging Dialects: Translating Standard Bangla to Regional Variants Using Neural Models [1.472830326343432]
The work is motivated by the need to preserve linguistic diversity and improve communication among dialect speakers.
The models were fine-tuned using the "Vashantor" dataset, containing 32,500 sentences across various dialects.
BanglaT5 demonstrated superior performance with a CER of 12.3% and WER of 15.7%, highlighting its effectiveness in capturing dialectal nuances.
arXiv Detail & Related papers (2025-01-10T06:50:51Z) - Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification [0.0]
We study the development of an automated grammar checker in Bangla, the seventh most spoken language in the world.
Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models.
Our system is evaluated on a dataset consisting of over 25,000 texts from various sources.
arXiv Detail & Related papers (2024-11-13T05:22:45Z) - Tibyan Corpus: Balanced and Comprehensive Error Coverage Corpus Using ChatGPT for Arabic Grammatical Error Correction [0.32885740436059047]
This study aims to develop an Arabic corpus called "Tibyan" for grammatical error correction using ChatGPT.
ChatGPT is used as a data augmenter tool based on a pair of Arabic sentences containing grammatical errors matched with a sentence free of errors extracted from Arabic books.
Our corpus contained 49 of errors, including seven types: orthography, syntax, semantics, punctuation, morphology, and split.
arXiv Detail & Related papers (2024-11-07T10:17:40Z) - How Ready Are Generative Pre-trained Large Language Models for Explaining Bengali Grammatical Errors? [0.4857223913212445]
Grammatical error correction (GEC) tools, powered by advanced generative artificial intelligence (AI), competently correct linguistic inaccuracies in user input.
However, they often fall short in providing essential natural language explanations.
In such languages, grammatical error explanation (GEE) systems should not only correct sentences but also provide explanations for errors.
arXiv Detail & Related papers (2024-05-27T15:56:45Z) - Hire a Linguist!: Learning Endangered Languages with In-Context Linguistic Descriptions [49.97641297850361]
LINGOLLM is a training-free approach to enable an LLM to process unseen languages that hardly occur in its pre-training.
We implement LINGOLLM on top of two models, GPT-4 and Mixtral, and evaluate their performance on 5 tasks across 8 endangered or low-resource languages.
Our results show that LINGOLLM elevates translation capability from GPT-4's 0 to 10.5 BLEU for 10 language directions.
arXiv Detail & Related papers (2024-02-28T03:44:01Z) - GEE! Grammar Error Explanation with Large Language Models [64.16199533560017]
We propose the task of grammar error explanation, where a system needs to provide one-sentence explanations for each grammatical error in a pair of erroneous and corrected sentences.
We analyze the capability of GPT-4 in grammar error explanation, and find that it only produces explanations for 60.2% of the errors using one-shot prompting.
We develop a two-step pipeline that leverages fine-tuned and prompted large language models to perform structured atomic token edit extraction.
arXiv Detail & Related papers (2023-11-16T02:45:47Z) - Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems.
This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z) - BanglaNLP at BLP-2023 Task 2: Benchmarking different Transformer Models
for Sentiment Analysis of Bangla Social Media Posts [0.46040036610482665]
This paper presents our submission to Task 2 (Sentiment Analysis of Bangla Social Media Posts) of the BLP Workshop.
Our quantitative results show that transfer learning really helps in better learning of the models in this low-resource language scenario.
We obtain a micro-F1 of 67.02% on the test set and our performance in this shared task is ranked at 21 in the leaderboard.
arXiv Detail & Related papers (2023-10-13T16:46:38Z) - Crossing the Threshold: Idiomatic Machine Translation through Retrieval
Augmentation and Loss Weighting [66.02718577386426]
We provide a simple characterization of idiomatic translation and related issues.
We conduct a synthetic experiment revealing a tipping point at which transformer-based machine translation models correctly default to idiomatic translations.
To improve translation of natural idioms, we introduce two straightforward yet effective techniques.
arXiv Detail & Related papers (2023-10-10T23:47:25Z) - Chinese Spelling Correction as Rephrasing Language Model [63.65217759957206]
We study Chinese Spelling Correction (CSC), which aims to detect and correct the potential spelling errors in a given sentence.
Current state-of-the-art methods regard CSC as a sequence tagging task and fine-tune BERT-based models on sentence pairs.
We propose Rephrasing Language Model (ReLM), where the model is trained to rephrase the entire sentence by infilling additional slots, instead of character-to-character tagging.
arXiv Detail & Related papers (2023-08-17T06:04:28Z) - GrammarGPT: Exploring Open-Source LLMs for Native Chinese Grammatical
Error Correction with Supervised Fine-Tuning [46.75740002185691]
We introduce GrammarGPT, an open-source Large Language Model, to explore its potential for native Chinese grammatical error correction.
For grammatical errors with clues, we proposed a method to guide ChatGPT to generate ungrammatical sentences by providing those clues.
For grammatical errors without clues, we collected ungrammatical sentences from publicly available websites and manually corrected them.
arXiv Detail & Related papers (2023-07-26T02:45:38Z) - Democratizing LLMs for Low-Resource Languages by Leveraging their English Dominant Abilities with Linguistically-Diverse Prompts [75.33019401706188]
Large language models (LLMs) are known to effectively perform tasks by simply observing few exemplars.
We propose to assemble synthetic exemplars from a diverse set of high-resource languages to prompt the LLMs to translate from any language into English.
Our unsupervised prompting method performs on par with supervised few-shot learning in LLMs of different sizes for translations between English and 13 Indic and 21 African low-resource languages.
arXiv Detail & Related papers (2023-06-20T08:27:47Z) - Bangla Grammatical Error Detection Using T5 Transformer Model [0.0]
This paper presents a method for detecting grammatical errors in Bangla using a Text-to-Text Transfer Transformer (T5 Language Model)
The T5 model was primarily designed for translation and is not specifically designed for this task, so extensive post-processing was necessary to adapt it to the task of error detection.
Our experiments show that the T5 model can achieve low Levenshtein Distance in detecting grammatical errors in Bangla, but post-processing is essential to achieve optimal performance.
arXiv Detail & Related papers (2023-03-19T09:24:48Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - LM-Critic: Language Models for Unsupervised Grammatical Error Correction [128.9174409251852]
We show how to leverage a pretrained language model (LM) in defining an LM-Critic, which judges a sentence to be grammatical.
We apply this LM-Critic and BIFI along with a large set of unlabeled sentences to bootstrap realistic ungrammatical / grammatical pairs for training a corrector.
arXiv Detail & Related papers (2021-09-14T17:06:43Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Automatic Extraction of Bengali Root Verbs using Paninian Grammar [0.0]
The proposed system has been developed based on tense, person and morphological inflections of the verbs to find their root forms.
The accuracy of the output has been achieved 98% which is verified by a linguistic expert.
arXiv Detail & Related papers (2020-03-31T20:22:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.