Investigating Masking-based Data Generation in Language Models
- URL: http://arxiv.org/abs/2307.00008v1
- Date: Fri, 16 Jun 2023 16:48:27 GMT
- Title: Investigating Masking-based Data Generation in Language Models
- Authors: Ed S. Ma
- Abstract summary: A feature of BERT and models with similar architecture is the objective of masked language modeling.
Data augmentation is a data-driven technique widely used in machine learning.
Recent studies have utilized masked language model to generate artificially augmented data for NLP downstream tasks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The current era of natural language processing (NLP) has been defined by the
prominence of pre-trained language models since the advent of BERT. A feature
of BERT and models with similar architecture is the objective of masked
language modeling, in which part of the input is intentionally masked and the
model is trained to predict this piece of masked information. Data augmentation
is a data-driven technique widely used in machine learning, including research
areas like computer vision and natural language processing, to improve model
performance by artificially augmenting the training data set by designated
techniques. Masked language models (MLM), an essential training feature of
BERT, have introduced a novel approach to perform effective pre-training on
Transformer based models in natural language processing tasks. Recent studies
have utilized masked language model to generate artificially augmented data for
NLP downstream tasks. The experimental results show that Mask based data
augmentation method provides a simple but efficient approach to improve the
model performance. In this paper, we explore and discuss the broader
utilization of these data augmentation methods based on MLM.
Related papers
- Deep Learning and Machine Learning -- Natural Language Processing: From Theory to Application [17.367710635990083]
We focus on natural language processing (NLP) and the role of large language models (LLMs)
This paper discusses advanced data preprocessing techniques and the use of frameworks like Hugging Face for implementing transformer-based models.
It highlights challenges such as handling multilingual data, reducing bias, and ensuring model robustness.
arXiv Detail & Related papers (2024-10-30T09:35:35Z) - Unsupervised Data Validation Methods for Efficient Model Training [0.0]
State-of-the-art models in natural language processing (NLP), text-to-speech (TTS), speech-to-text (STT) and vision-language models (VLM) rely heavily on large datasets.
This research explores key areas such as defining "quality data," developing methods for generating appropriate data and enhancing accessibility to model training.
arXiv Detail & Related papers (2024-10-10T13:00:53Z) - Language Model Adaptation to Specialized Domains through Selective
Masking based on Genre and Topical Characteristics [4.9639158834745745]
We introduce an innovative masking approach leveraging genre and topicality information to tailor language models to specialized domains.
Our method incorporates a ranking process that prioritizes words based on their significance, subsequently guiding the masking procedure.
Experiments conducted using continual pre-training within the legal domain have underscored the efficacy of our approach on the LegalGLUE benchmark in the English language.
arXiv Detail & Related papers (2024-02-19T10:43:27Z) - Iterative Mask Filling: An Effective Text Augmentation Method Using
Masked Language Modeling [0.0]
We propose a novel text augmentation method that leverages the Fill-Mask feature of the transformer-based BERT model.
Our method involves iteratively masking words in a sentence and replacing them with language model predictions.
Experimental results show that our proposed method significantly improves performance, especially on topic classification datasets.
arXiv Detail & Related papers (2024-01-03T16:47:13Z) - Masked Autoencoding for Scalable and Generalizable Decision Making [93.84855114717062]
MaskDP is a simple and scalable self-supervised pretraining method for reinforcement learning and behavioral cloning.
We find that a MaskDP model gains the capability of zero-shot transfer to new BC tasks, such as single and multiple goal reaching.
arXiv Detail & Related papers (2022-11-23T07:04:41Z) - Recent Advances in Natural Language Processing via Large Pre-Trained
Language Models: A Survey [67.82942975834924]
Large, pre-trained language models such as BERT have drastically changed the Natural Language Processing (NLP) field.
We present a survey of recent work that uses these large language models to solve NLP tasks via pre-training then fine-tuning, prompting, or text generation approaches.
arXiv Detail & Related papers (2021-11-01T20:08:05Z) - Reprogramming Language Models for Molecular Representation Learning [65.00999660425731]
We propose Representation Reprogramming via Dictionary Learning (R2DL) for adversarially reprogramming pretrained language models for molecular learning tasks.
The adversarial program learns a linear transformation between a dense source model input space (language data) and a sparse target model input space (e.g., chemical and biological molecule data) using a k-SVD solver.
R2DL achieves the baseline established by state of the art toxicity prediction models trained on domain-specific data and outperforms the baseline in a limited training-data setting.
arXiv Detail & Related papers (2020-12-07T05:50:27Z) - Fine-tuning BERT for Low-Resource Natural Language Understanding via
Active Learning [30.5853328612593]
In this work, we explore fine-tuning methods of BERT -- a pre-trained Transformer based language model.
Our experimental results show an advantage in model performance by maximizing the approximate knowledge gain of the model.
We analyze the benefits of freezing layers of the language model during fine-tuning to reduce the number of trainable parameters.
arXiv Detail & Related papers (2020-12-04T08:34:39Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Exploring Software Naturalness through Neural Language Models [56.1315223210742]
The Software Naturalness hypothesis argues that programming languages can be understood through the same techniques used in natural language processing.
We explore this hypothesis through the use of a pre-trained transformer-based language model to perform code analysis tasks.
arXiv Detail & Related papers (2020-06-22T21:56:14Z) - UniLMv2: Pseudo-Masked Language Models for Unified Language Model
Pre-Training [152.63467944568094]
We propose to pre-train a unified language model for both autoencoding and partially autoregressive language modeling tasks.
Our experiments show that the unified language models pre-trained using PMLM achieve new state-of-the-art results on a wide range of natural language understanding and generation tasks.
arXiv Detail & Related papers (2020-02-28T15:28:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.