SAS: Self-Augmented Strategy for Language Model Pre-training
- URL: http://arxiv.org/abs/2106.07176v1
- Date: Mon, 14 Jun 2021 05:57:46 GMT
- Title: SAS: Self-Augmented Strategy for Language Model Pre-training
- Authors: Yifei Xu, Jingqiao Zhang, Ru He, Liangzhu Ge, Chao Yang, Cheng Yang,
Ying Nian Wu
- Abstract summary: Most data augmentations in language model pre-training are context-independent.
We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch.
Our SAS is able to outperform the ELECTRA and other state-of-the-art models in the computation GLUE tasks with the same or less cost.
- Score: 31.69657711092598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The core of a self-supervised learning method for pre-training language
models includes the design of appropriate data augmentation and corresponding
pre-training task(s). Most data augmentations in language model pre-training
are context-independent. The seminal contextualized augmentation recently
proposed by the ELECTRA requires a separate generator, which leads to extra
computation cost as well as the challenge in adjusting the capability of its
generator relative to that of the other model component(s). We propose a
self-augmented strategy (SAS) that uses a single forward pass through the model
to augment the input data for model training in the next epoch. Essentially our
strategy eliminates a separate generator network and uses only one network to
generate the data augmentation and undertake two pre-training tasks (the MLM
task and the RTD task) jointly, which naturally avoids the challenge in
adjusting the generator's capability as well as reduces the computation cost.
Additionally, our SAS is a general strategy such that it can seamlessly
incorporate many new techniques emerging recently or in the future, such as the
disentangled attention mechanism recently proposed by the DeBERTa model. Our
experiments show that our SAS is able to outperform the ELECTRA and other
state-of-the-art models in the GLUE tasks with the same or less computation
cost.
Related papers
- Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning [6.1339395157466425]
Foundational deep learning (DL) models are general models, trained on diverse, diverse, and unlabelled datasets.
We introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals.
arXiv Detail & Related papers (2024-11-14T23:56:57Z) - Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts.
Existing approaches require re-training models on different data subsets, which is computationally intensive.
This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z) - Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning.
Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation.
Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z) - EsaCL: Efficient Continual Learning of Sparse Models [10.227171407348326]
Key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks.
We propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power.
arXiv Detail & Related papers (2024-01-11T04:59:44Z) - Instructed Language Models with Retrievers Are Powerful Entity Linkers [87.16283281290053]
Instructed Generative Entity Linker (INSGENEL) is the first approach that enables casual language models to perform entity linking over knowledge bases.
INSGENEL outperforms previous generative alternatives with +6.8 F1 points gain on average.
arXiv Detail & Related papers (2023-11-06T16:38:51Z) - How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities.
We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z) - Fast-ELECTRA for Efficient Pre-training [83.29484808667532]
ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model.
We propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model.
Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model.
arXiv Detail & Related papers (2023-10-11T09:55:46Z) - SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark
for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks.
We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain.
We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z) - Maximizing Efficiency of Language Model Pre-training for Learning
Representation [6.518508607788086]
ELECTRA is a novel approach for improving the compute efficiency of pre-trained language models.
Our work proposes adaptive early exit strategy to maximize the efficiency of the pre-training process.
arXiv Detail & Related papers (2021-10-13T10:25:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.