Patch-Level Training for Large Language Models
- URL: http://arxiv.org/abs/2407.12665v2
- Date: Fri, 13 Sep 2024 03:07:37 GMT
- Title: Patch-Level Training for Large Language Models
- Authors: Chenze Shao, Fandong Meng, Jie Zhou,
- Abstract summary: This paper introduces patch-level training for Large Language Models (LLMs)
During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch.
Following this, the model continues token-level training on the remaining training data to align with the inference mode.
- Score: 69.67438563485887
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5$\times$, without compromising the model performance compared to token-level training. Source code: \url{https://github.com/shaochenze/PatchTrain}.
Related papers
- MultiTok: Variable-Length Tokenization for Efficient LLMs Adapted from LZW Compression [5.5795785998430185]
MultiTok is a new tokenizing tool inspired by universal Lempel-Ziv-Welch data compression.
We show that MultiTok achieves a comparable performance to the BERT standard as a tokenizer.
arXiv Detail & Related papers (2024-10-28T21:24:51Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - MiLe Loss: a New Loss for Mitigating the Bias of Learning Difficulties in Generative Language Models [40.992566245706996]
We propose a MiLe Loss function for mitigating the bias of learning difficulties with tokens.
We train generative language models at different scales of 468M, 1.2B, and 6.7B parameters.
Experiments reveal that models incorporating the proposed MiLe Loss can gain consistent performance improvement on downstream benchmarks.
arXiv Detail & Related papers (2023-10-30T13:33:21Z) - Training Compute-Optimal Large Language Models [54.00424650998489]
We train language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens.
For compute-optimal training, the model size and the number of training tokens should be scaled equally.
chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B)
arXiv Detail & Related papers (2022-03-29T13:38:03Z) - Towards Efficient Post-training Quantization of Pre-trained Language
Models [85.68317334241287]
We study post-training quantization(PTQ) of PLMs, and propose module-wise quantization error minimization(MREM), an efficient solution to mitigate these issues.
Experiments on GLUE and SQuAD benchmarks show that our proposed PTQ solution not only performs close to QAT, but also enjoys significant reductions in training time, memory overhead, and data consumption.
arXiv Detail & Related papers (2021-09-30T12:50:06Z) - Exploring Unsupervised Pretraining Objectives for Machine Translation [99.5441395624651]
Unsupervised cross-lingual pretraining has achieved strong results in neural machine translation (NMT)
Most approaches adapt masked-language modeling (MLM) to sequence-to-sequence architectures, by masking parts of the input and reconstructing them in the decoder.
We compare masking with alternative objectives that produce inputs resembling real (full) sentences, by reordering and replacing words based on their context.
arXiv Detail & Related papers (2021-06-10T10:18:23Z) - Token-wise Curriculum Learning for Neural Machine Translation [94.93133801641707]
Existing curriculum learning approaches to Neural Machine Translation (NMT) require sufficient sampling amounts of "easy" samples from training data at the early training stage.
We propose a novel token-wise curriculum learning approach that creates sufficient amounts of easy samples.
Our approach can consistently outperform baselines on 5 language pairs, especially for low-resource languages.
arXiv Detail & Related papers (2021-03-20T03:57:59Z) - Progressively Stacking 2.0: A Multi-stage Layerwise Training Method for
BERT Training Speedup [13.50984315473865]
We propose an efficient multi-stage layerwise training (MSLT) approach to reduce the training time of BERT.
In the proposed training strategy, only top few layers participate in backward computation, while most layers only participate in forward computation.
Experimental results show that the proposed method can achieve more than 110% training speedup without significant performance degradation.
arXiv Detail & Related papers (2020-11-27T10:00:22Z) - Update Frequently, Update Fast: Retraining Semantic Parsing Systems in a
Fraction of Time [11.035461657669096]
We show that it is possible to match the performance of a model trained from scratch in less than 10% of a time via fine-tuning.
We demonstrate the effectiveness of our method on multiple splits of the Facebook TOP and SNIPS datasets.
arXiv Detail & Related papers (2020-10-15T16:37:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.