Related papers: BDA: Bangla Text Data Augmentation Framework

BDA: Bangla Text Data Augmentation Framework

URL: http://arxiv.org/abs/2412.08753v2
Date: Thu, 26 Dec 2024 18:50:10 GMT
Title: BDA: Bangla Text Data Augmentation Framework
Authors: Md. Tariquzzaman, Audwit Nafi Anam, Naimul Haque, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan,
Abstract summary: In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data.<n>This paper introduces a Bangla Text Data Augmentation Framework that uses both pre-trained models and rule-based methods to create new variants of the text.
Score: 3.639885019250394
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Data augmentation involves generating synthetic samples that resemble those in a given dataset. In resource-limited fields where high-quality data is scarce, augmentation plays a crucial role in increasing the volume of training data. This paper introduces a Bangla Text Data Augmentation (BDA) Framework that uses both pre-trained models and rule-based methods to create new variants of the text. A filtering process is included to ensure that the new text keeps the same meaning as the original while also adding variety in the words used. We conduct a comprehensive evaluation of the framework's effectiveness in Bangla text classification tasks. Our framework achieved significant improvement in F1 scores across five distinct datasets, delivering performance equivalent to models trained on 100% of the data while utilizing only 50% of the training dataset. Additionally, we explore the impact of data scarcity by progressively reducing the training data and augmenting it through BDA, resulting in notable F1 score enhancements. The study offers a thorough examination of BDA's performance, identifying key factors for optimal results and addressing its limitations through detailed analysis.

Related papers

Reducing and Exploiting Data Augmentation Noise through Meta Reweighting Contrastive Learning for Text Classification [3.9889306957591755]
We propose a novel framework to boost deep learning models' performance given augmented data/samples in text classification tasks. We propose novel weight-dependent enqueue and dequeue algorithms to utilize augmented samples' weight/quality information effectively. Our framework achieves an average of 1.6%, up to 4.3% absolute improvement on Text-CNN encoders and an average of 1.4%, up to 4.4% absolute improvement on RoBERTa-base encoders.
arXiv Detail & Related papers (2024-09-26T02:19:13Z)
On Evaluation Protocols for Data Augmentation in a Limited Data Scenario [11.09784120582206]
We show that classical data augmentation (which modify sentences) is simply a way of performing better fine-tuning. We further show that zero- and few-shot DA via conversational agents such as ChatGPT or LLama2 can increase performances.
arXiv Detail & Related papers (2024-02-22T16:42:37Z)
Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries. Experimental results show that our method improves consistently over existing methods. Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
Faithful Low-Resource Data-to-Text Generation through Cycle Training [14.375070014155817]
Methods to generate text from structured data have advanced significantly in recent years. Cycle training uses two models which are inverses of each other. We show that cycle training achieves nearly the same performance as fully supervised approaches.
arXiv Detail & Related papers (2023-05-24T06:44:42Z)
Data Augmentation using Transformers and Similarity Measures for Improving Arabic Text Classification [0.0]
We propose a new Arabic DA method that employs the recent powerful modeling technique, namely the AraGPT-2. The generated sentences are evaluated in terms of context, semantics, diversity, and novelty using the Euclidean, cosine, Jaccard, and BLEU distances. The experiments were conducted on four sentiment Arabic datasets: AraSarcasm, ASTD, ATT, and MOVIE.
arXiv Detail & Related papers (2022-12-28T16:38:43Z)
Curriculum-Based Self-Training Makes Better Few-Shot Learners for Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation. Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z)
Retrieval Enhanced Data Augmentation for Question Answering on Privacy Policies [74.01792675564218]
We develop a data augmentation framework based on ensembling retriever models that captures relevant text segments from unlabeled policy documents. To improve the diversity and quality of the augmented data, we leverage multiple pre-trained language models (LMs) and cascade them with noise reduction filter models. Using our augmented data on the PrivacyQA benchmark, we elevate the existing baseline by a large margin (10% F1) and achieve a new state-of-the-art F1 score of 50%.
arXiv Detail & Related papers (2022-04-19T15:45:23Z)
CoDA: Contrast-enhanced and Diversity-promoting Data Augmentation for Natural Language Understanding [67.61357003974153]
We propose a novel data augmentation framework dubbed CoDA. CoDA synthesizes diverse and informative augmented examples by integrating multiple transformations organically. A contrastive regularization objective is introduced to capture the global relationship among all the data samples.
arXiv Detail & Related papers (2020-10-16T23:57:03Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.