Related papers: SAS: Self-Augmented Strategy for Language Model Pre-training

SAS: Self-Augmented Strategy for Language Model Pre-training

URL: http://arxiv.org/abs/2106.07176v1
Date: Mon, 14 Jun 2021 05:57:46 GMT
Title: SAS: Self-Augmented Strategy for Language Model Pre-training
Authors: Yifei Xu, Jingqiao Zhang, Ru He, Liangzhu Ge, Chao Yang, Cheng Yang, Ying Nian Wu
Abstract summary: Most data augmentations in language model pre-training are context-independent. We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch. Our SAS is able to outperform the ELECTRA and other state-of-the-art models in the computation GLUE tasks with the same or less cost.
Score: 31.69657711092598
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The core of a self-supervised learning method for pre-training language models includes the design of appropriate data augmentation and corresponding pre-training task(s). Most data augmentations in language model pre-training are context-independent. The seminal contextualized augmentation recently proposed by the ELECTRA requires a separate generator, which leads to extra computation cost as well as the challenge in adjusting the capability of its generator relative to that of the other model component(s). We propose a self-augmented strategy (SAS) that uses a single forward pass through the model to augment the input data for model training in the next epoch. Essentially our strategy eliminates a separate generator network and uses only one network to generate the data augmentation and undertake two pre-training tasks (the MLM task and the RTD task) jointly, which naturally avoids the challenge in adjusting the generator's capability as well as reduces the computation cost. Additionally, our SAS is a general strategy such that it can seamlessly incorporate many new techniques emerging recently or in the future, such as the disentangled attention mechanism recently proposed by the DeBERTa model. Our experiments show that our SAS is able to outperform the ELECTRA and other state-of-the-art models in the GLUE tasks with the same or less computation cost.

Related papers

Evolution without Large Models: Training Language Model with Task Principles [52.44569608690695]
A common training approach for language models involves using a large-scale language model to expand a human-provided dataset.<n>This method significantly reduces training costs by eliminating the need for extensive human data annotation.<n>However, it still faces challenges such as high carbon emissions during data augmentation and the risk of data leakage.
arXiv Detail & Related papers (2025-07-08T13:52:45Z)
Self-Supervised Radio Pre-training: Toward Foundational Models for Spectrogram Learning [6.1339395157466425]
Foundational deep learning (DL) models are general models, trained on diverse, diverse, and unlabelled datasets. We introduce Masked Spectrogram Modeling, a novel self-supervised learning approach for pretraining foundational DL models on radio signals.
arXiv Detail & Related papers (2024-11-14T23:56:57Z)
Data Shapley in One Training Run [88.59484417202454]
Data Shapley provides a principled framework for attributing data's contribution within machine learning contexts. Existing approaches require re-training models on different data subsets, which is computationally intensive. This paper introduces In-Run Data Shapley, which addresses these limitations by offering scalable data attribution for a target model of interest.
arXiv Detail & Related papers (2024-06-16T17:09:24Z)
Diffusion-Based Neural Network Weights Generation [80.89706112736353]
D2NWG is a diffusion-based neural network weights generation technique that efficiently produces high-performing weights for transfer learning. Our method extends generative hyper-representation learning to recast the latent diffusion paradigm for neural network weights generation. Our approach is scalable to large architectures such as large language models (LLMs), overcoming the limitations of current parameter generation techniques.
arXiv Detail & Related papers (2024-02-28T08:34:23Z)
EsaCL: Efficient Continual Learning of Sparse Models [10.227171407348326]
Key challenge in the continual learning setting is to efficiently learn a sequence of tasks without forgetting how to perform previously learned tasks. We propose a new method for efficient continual learning of sparse models (EsaCL) that can automatically prune redundant parameters without adversely impacting the model's predictive power.
arXiv Detail & Related papers (2024-01-11T04:59:44Z)
Instructed Language Models with Retrievers Are Powerful Entity Linkers [87.16283281290053]
Instructed Generative Entity Linker (INSGENEL) is the first approach that enables casual language models to perform entity linking over knowledge bases. INSGENEL outperforms previous generative alternatives with +6.8 F1 points gain on average.
arXiv Detail & Related papers (2023-11-06T16:38:51Z)
How Many Pretraining Tasks Are Needed for In-Context Learning of Linear Regression? [92.90857135952231]
Transformers pretrained on diverse tasks exhibit remarkable in-context learning (ICL) capabilities. We study ICL in one of its simplest setups: pretraining a linearly parameterized single-layer linear attention model for linear regression.
arXiv Detail & Related papers (2023-10-12T15:01:43Z)
Fast-ELECTRA for Efficient Pre-training [83.29484808667532]
ELECTRA pre-trains language models by detecting tokens in a sequence that have been replaced by an auxiliary model. We propose Fast-ELECTRA, which leverages an existing language model as the auxiliary model. Our approach rivals the performance of state-of-the-art ELECTRA-style pre-training methods, while significantly eliminating the computation and memory cost brought by the joint training of the auxiliary model.
arXiv Detail & Related papers (2023-10-11T09:55:46Z)
SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities [76.97949110580703]
We introduce SUPERB-SG, a new benchmark to evaluate pre-trained models across various speech tasks. We use a lightweight methodology to test the robustness of representations learned by pre-trained models under shifts in data domain. We also show that the task diversity of SUPERB-SG coupled with limited task supervision is an effective recipe for evaluating the generalizability of model representation.
arXiv Detail & Related papers (2022-03-14T04:26:40Z)
Maximizing Efficiency of Language Model Pre-training for Learning Representation [6.518508607788086]
ELECTRA is a novel approach for improving the compute efficiency of pre-trained language models. Our work proposes adaptive early exit strategy to maximize the efficiency of the pre-training process.
arXiv Detail & Related papers (2021-10-13T10:25:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.