Related papers: Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification

Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification

URL: http://arxiv.org/abs/2411.08344v1
Date: Wed, 13 Nov 2024 05:22:45 GMT
Title: Bangla Grammatical Error Detection Leveraging Transformer-based Token Classification
Authors: Shayekh Bin Islam, Ridwanul Hasan Tanvir, Sihat Afnan,
Abstract summary: We study the development of an automated grammar checker in Bangla, the seventh most spoken language in the world. Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models. Our system is evaluated on a dataset consisting of over 25,000 texts from various sources.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Bangla is the seventh most spoken language by a total number of speakers in the world, and yet the development of an automated grammar checker in this language is an understudied problem. Bangla grammatical error detection is a task of detecting sub-strings of a Bangla text that contain grammatical, punctuation, or spelling errors, which is crucial for developing an automated Bangla typing assistant. Our approach involves breaking down the task as a token classification problem and utilizing state-of-the-art transformer-based models. Finally, we combine the output of these models and apply rule-based post-processing to generate a more reliable and comprehensive result. Our system is evaluated on a dataset consisting of over 25,000 texts from various sources. Our best model achieves a Levenshtein distance score of 1.04. Finally, we provide a detailed analysis of different components of our system.

Related papers

Automatic Correction of Writing Anomalies in Hausa Texts [0.0]
Hausa texts are often characterized by writing anomalies such as incorrect character substitutions and spacing errors.<n>This paper presents an approach to automatically correct the anomalies by finetuning transformer-based models.
arXiv Detail & Related papers (2025-06-04T10:46:19Z)
Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs) We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences. This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z)
BanglaNLP at BLP-2023 Task 1: Benchmarking different Transformer Models for Violence Inciting Text Detection in Bengali [0.46040036610482665]
This paper presents the system that we have developed while solving this shared task on violence inciting text detection in Bangla. We explain both the traditional and the recent approaches that we used to make our models learn. Our proposed system helps to classify if the given text contains any threat.
arXiv Detail & Related papers (2023-10-16T19:35:04Z)
Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora [0.0]
Grammatical error correction (GEC) is the task of correcting typos, spelling, punctuation and grammatical issues in text. We show that a byte-level model enables higher correction quality than a subword approach.
arXiv Detail & Related papers (2023-05-29T06:35:40Z)
Bangla Grammatical Error Detection Using T5 Transformer Model [0.0]
This paper presents a method for detecting grammatical errors in Bangla using a Text-to-Text Transfer Transformer (T5 Language Model) The T5 model was primarily designed for translation and is not specifically designed for this task, so extensive post-processing was necessary to adapt it to the task of error detection. Our experiments show that the T5 model can achieve low Levenshtein Distance in detecting grammatical errors in Bangla, but post-processing is essential to achieve optimal performance.
arXiv Detail & Related papers (2023-03-19T09:24:48Z)
READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises. READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input. We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z)
A transformer-based spelling error correction framework for Bangla and resource scarce Indic languages [2.5874041837241304]
Spelling error correction is the task of identifying and rectifying misspelled words in texts. Earlier efforts on spelling error correction in Bangla and resource-scarce Indic languages focused on rule-based, statistical, and machine learning-based methods. We propose a novel detector-purificator-corrector, DPC based on denoising transformers by addressing previous issues.
arXiv Detail & Related papers (2022-11-07T17:59:05Z)
Improving Pre-trained Language Models with Syntactic Dependency Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors. Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z)
Speaker Embedding-aware Neural Diarization for Flexible Number of Speakers with Textual Information [55.75018546938499]
We propose the speaker embedding-aware neural diarization (SEND) method, which predicts the power set encoded labels. Our method achieves lower diarization error rate than the target-speaker voice activity detection.
arXiv Detail & Related papers (2021-11-28T12:51:04Z)
Bangla Text Classification using Transformers [2.3475904942266697]
Text classification has been one of the earliest problems in NLP. In this work, we fine-tune multilingual Transformer models for Bangla text classification tasks. We obtain the state of the art results on six benchmark datasets, improving upon the previous results by 5-29% accuracy across different tasks.
arXiv Detail & Related papers (2020-11-09T14:12:07Z)
Comparison of Interactive Knowledge Base Spelling Correction Models for Low-Resource Languages [81.90356787324481]
Spelling normalization for low resource languages is a challenging task because the patterns are hard to predict. This work shows a comparison of a neural model and character language models with varying amounts on target language data. Our usage scenario is interactive correction with nearly zero amounts of training examples, improving models as more data is collected.
arXiv Detail & Related papers (2020-10-20T17:31:07Z)
On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data. Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z)
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer [64.22926988297685]
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP) In this paper, we explore the landscape of introducing transfer learning techniques for NLP by a unified framework that converts all text-based language problems into a text-to-text format.
arXiv Detail & Related papers (2019-10-23T17:37:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.