Learning to Skip for Language Modeling
- URL: http://arxiv.org/abs/2311.15436v1
- Date: Sun, 26 Nov 2023 21:45:53 GMT
- Title: Learning to Skip for Language Modeling
- Authors: Dewen Zeng, Nan Du, Tao Wang, Yuanzhong Xu, Tao Lei, Zhifeng Chen,
Claire Cui
- Abstract summary: We argue that in language model pretraining, a variable amount of computation should be assigned to different tokens.
In our evaluation across 24 NLP tasks, we demonstrate that the proposed method can significantly improve the 1-shot performance.
- Score: 33.51322197222855
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Overparameterized large-scale language models have impressive generalization
performance of in-context few-shot learning. However, most language models
allocate the same amount of parameters or computation to each token,
disregarding the complexity or importance of the input data. We argue that in
language model pretraining, a variable amount of computation should be assigned
to different tokens, and this can be efficiently achieved via a simple routing
mechanism. Different from conventional early stopping techniques where tokens
can early exit at only early layers, we propose a more general method that
dynamically skips the execution of a layer (or module) for any input token with
a binary router. In our extensive evaluation across 24 NLP tasks, we
demonstrate that the proposed method can significantly improve the 1-shot
performance compared to other competitive baselines only at mild extra cost for
inference.
Related papers
- Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Bidirectional Representations for Low Resource Spoken Language
Understanding [39.208462511430554]
We propose a representation model to encode speech in bidirectional rich encodings.
The approach uses a masked language modelling objective to learn the representations.
We show that the performance of the resulting encodings is better than comparable models on multiple datasets.
arXiv Detail & Related papers (2022-11-24T17:05:16Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Probing Structured Pruning on Multilingual Pre-trained Models: Settings,
Algorithms, and Efficiency [62.0887259003594]
This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency.
Experiments on nine downstream tasks show several counter-intuitive phenomena.
We present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference.
arXiv Detail & Related papers (2022-04-06T06:29:52Z) - Switch Point biased Self-Training: Re-purposing Pretrained Models for
Code-Switching [44.034300203700234]
Code-switching is a ubiquitous phenomenon due to the ease of communication it offers in multilingual communities.
We propose a self training method to repurpose the existing pretrained models using a switch-point bias.
Our approach performs well on both tasks by reducing the gap between the switch point performance.
arXiv Detail & Related papers (2021-11-01T19:42:08Z) - Differentiable Prompt Makes Pre-trained Language Models Better Few-shot
Learners [23.150999852147283]
This study proposes a novel pluggable, and efficient approach named DifferentiAble pRompT (DART)
It can convert small language models into better few-shot learners without any prompt engineering.
A comprehensive evaluation of standard NLP tasks demonstrates that the proposed approach achieves a better few-shot performance.
arXiv Detail & Related papers (2021-08-30T12:29:25Z) - Efficient Weight factorization for Multilingual Speech Recognition [67.00151881207792]
End-to-end multilingual speech recognition involves using a single model training on a compositional speech corpus including many languages.
Due to the fact that each language in the training data has different characteristics, the shared network may struggle to optimize for all various languages simultaneously.
We propose a novel multilingual architecture that targets the core operation in neural networks: linear transformation functions.
arXiv Detail & Related papers (2021-05-07T00:12:02Z) - WARP: Word-level Adversarial ReProgramming [13.08689221166729]
In many applications it is preferable to tune much smaller sets of parameters, so that the majority of parameters can be shared across multiple tasks.
We present an alternative approach based on adversarial reprogramming, which extends earlier work on automatic prompt generation.
We show that this approach outperforms other methods with a similar number of trainable parameters on SST-2 and MNLI datasets.
arXiv Detail & Related papers (2021-01-01T00:41:03Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - Learning Spoken Language Representations with Neural Lattice Language
Modeling [39.50831917042577]
We propose a framework that trains neural lattice language models to provide contextualized representations for spoken language understanding tasks.
The proposed two-stage pre-training approach reduces the demands of speech data and has better efficiency.
arXiv Detail & Related papers (2020-07-06T10:38:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.