Iterative Mask Filling: An Effective Text Augmentation Method Using
Masked Language Modeling
- URL: http://arxiv.org/abs/2401.01830v1
- Date: Wed, 3 Jan 2024 16:47:13 GMT
- Title: Iterative Mask Filling: An Effective Text Augmentation Method Using
Masked Language Modeling
- Authors: Himmet Toprak Kesgin, Mehmet Fatih Amasyali
- Abstract summary: We propose a novel text augmentation method that leverages the Fill-Mask feature of the transformer-based BERT model.
Our method involves iteratively masking words in a sentence and replacing them with language model predictions.
Experimental results show that our proposed method significantly improves performance, especially on topic classification datasets.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data augmentation is an effective technique for improving the performance of
machine learning models. However, it has not been explored as extensively in
natural language processing (NLP) as it has in computer vision. In this paper,
we propose a novel text augmentation method that leverages the Fill-Mask
feature of the transformer-based BERT model. Our method involves iteratively
masking words in a sentence and replacing them with language model predictions.
We have tested our proposed method on various NLP tasks and found it to be
effective in many cases. Our results are presented along with a comparison to
existing augmentation methods. Experimental results show that our proposed
method significantly improves performance, especially on topic classification
datasets.
Related papers
- A Simple yet Efficient Ensemble Approach for AI-generated Text Detection [0.5840089113969194]
Large Language Models (LLMs) have demonstrated remarkable capabilities in generating text that closely resembles human writing.
It is essential to build automated approaches capable of distinguishing between artificially generated text and human-authored text.
We propose a simple yet efficient solution by ensembling predictions from multiple constituent LLMs.
arXiv Detail & Related papers (2023-11-06T13:11:02Z) - Investigating Masking-based Data Generation in Language Models [0.0]
A feature of BERT and models with similar architecture is the objective of masked language modeling.
Data augmentation is a data-driven technique widely used in machine learning.
Recent studies have utilized masked language model to generate artificially augmented data for NLP downstream tasks.
arXiv Detail & Related papers (2023-06-16T16:48:27Z) - Improving Pre-trained Language Model Fine-tuning with Noise Stability
Regularization [94.4409074435894]
We propose a novel and effective fine-tuning framework, named Layerwise Noise Stability Regularization (LNSR)
Specifically, we propose to inject the standard Gaussian noise and regularize hidden representations of the fine-tuned model.
We demonstrate the advantages of the proposed method over other state-of-the-art algorithms including L2-SP, Mixout and SMART.
arXiv Detail & Related papers (2022-06-12T04:42:49Z) - Sample Efficient Approaches for Idiomaticity Detection [6.481818246474555]
This work explores sample efficient methods of idiomaticity detection.
In particular, we study the impact of Pattern Exploit Training (PET), a few-shot method of classification, and BERTRAM, an efficient method of creating contextual embeddings.
Our experiments show that whilePET improves performance on English, they are much less effective on Portuguese and Galician, leading to an overall performance about on par with vanilla mBERT.
arXiv Detail & Related papers (2022-05-23T13:46:35Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Data Augmentation for Voice-Assistant NLU using BERT-based
Interchangeable Rephrase [39.09474362100266]
We introduce a data augmentation technique based on byte pair encoding and a BERT-like self-attention model to boost performance on spoken language understanding tasks.
We show our method performs strongly on domain and intent classification tasks for a voice assistant and in a user-study focused on utterance naturalness and semantic similarity.
arXiv Detail & Related papers (2021-04-16T17:53:58Z) - Neural Mask Generator: Learning to Generate Adaptive Word Maskings for
Language Model Adaptation [63.195935452646815]
We propose a method to automatically generate a domain- and task-adaptive maskings of the given text for self-supervised pre-training.
We present a novel reinforcement learning-based framework which learns the masking policy.
We validate our Neural Mask Generator (NMG) on several question answering and text classification datasets.
arXiv Detail & Related papers (2020-10-06T13:27:01Z) - Syntax-aware Data Augmentation for Neural Machine Translation [76.99198797021454]
We propose a novel data augmentation strategy for neural machine translation.
We set sentence-specific probability for word selection by considering their roles in sentence.
Our proposed method is evaluated on WMT14 English-to-German dataset and IWSLT14 German-to-English dataset.
arXiv Detail & Related papers (2020-04-29T13:45:30Z) - Masking as an Efficient Alternative to Finetuning for Pretrained
Language Models [49.64561153284428]
We learn selective binary masks for pretrained weights in lieu of modifying them through finetuning.
In intrinsic evaluations, we show that representations computed by masked language models encode information necessary for solving downstream tasks.
arXiv Detail & Related papers (2020-04-26T15:03:47Z) - Investigating the Effectiveness of Representations Based on Pretrained
Transformer-based Language Models in Active Learning for Labelling Text
Datasets [4.7718339202518685]
The representation mechanism used to represent text documents when performing active learning has a significant influence on how effective the process will be.
This paper describes a comprehensive evaluation of the effectiveness of representations based on pre-trained neural network models for active learning.
Our experiments show that the limited label information acquired in active learning can not only be used for training a classifier but can also adaptively improve the embeddings generated by the BERT-like language models as well.
arXiv Detail & Related papers (2020-04-21T02:37:44Z) - GridMask Data Augmentation [76.79300104795966]
We propose a novel data augmentation method GridMask' in this paper.
It utilizes information removal to achieve state-of-the-art results in a variety of computer vision tasks.
arXiv Detail & Related papers (2020-01-13T07:27:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.