Transcending Scaling Laws with 0.1% Extra Compute
- URL: http://arxiv.org/abs/2210.11399v1
- Date: Thu, 20 Oct 2022 16:46:41 GMT
- Title: Transcending Scaling Laws with 0.1% Extra Compute
- Authors: Yi Tay, Jason Wei, Hyung Won Chung, Vinh Q. Tran, David R. So, Siamak
Shakeri, Xavier Garcia, Huaixiu Steven Zheng, Jinfeng Rao, Aakanksha
Chowdhery, Denny Zhou, Donald Metzler, Slav Petrov, Neil Houlsby, Quoc V. Le,
Mostafa Dehghani
- Abstract summary: Scaling language models improves performance but comes with significant computational costs.
This paper proposes UL2R, a method that substantially improves existing language models and their scaling curves with a relatively tiny amount of extra compute.
We show that, with almost negligible extra computational costs and no new sources of data, we are able to substantially improve the scaling properties of large language models on downstream metrics.
- Score: 128.13903265447675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling language models improves performance but comes with significant
computational costs. This paper proposes UL2R, a method that substantially
improves existing language models and their scaling curves with a relatively
tiny amount of extra compute. The key idea is to continue training a
state-of-the-art large language model (e.g., PaLM) on a few more steps with
UL2's mixture-of-denoiser objective. We show that, with almost negligible extra
computational costs and no new sources of data, we are able to substantially
improve the scaling properties of large language models on downstream metrics.
In this paper, we continue training PaLM with UL2R, introducing a new set of
models at 8B, 62B, and 540B scale which we call U-PaLM. Impressively, at 540B
scale, we show an approximately 2x computational savings rate where U-PaLM
achieves the same performance as the final PaLM 540B model at around half its
computational budget (i.e., saving $\sim$4.4 million TPUv4 hours). We further
show that this improved scaling curve leads to 'emergent abilities' on
challenging BIG-Bench tasks -- for instance, U-PaLM does much better than PaLM
on some tasks or demonstrates better quality at much smaller scale (62B as
opposed to 540B). Overall, we show that U-PaLM outperforms PaLM on many
few-shot setups, i.e., English NLP tasks (e.g., commonsense reasoning, question
answering), reasoning tasks with chain-of-thought (e.g., GSM8K), multilingual
tasks (MGSM, TydiQA), MMLU and challenging BIG-Bench tasks. Finally, we provide
qualitative examples showing the new capabilities of U-PaLM for single and
multi-span infilling.
Related papers
- Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance [78.48606021719206]
Mini-InternVL is a series of MLLMs with parameters ranging from 1B to 4B, which achieves 90% of the performance with only 5% of the parameters.
We develop a unified adaptation framework for Mini-InternVL, which enables our models to transfer and outperform specialized models in downstream tasks.
arXiv Detail & Related papers (2024-10-21T17:58:20Z) - Are Protein Language Models Compute Optimal? [0.0]
We investigate the optimal ratio between model parameters and training tokens within a fixed compute budget.
Our study reveals that pLM sizes scale sublinearly with compute budget, showing diminishing returns in performance as model size increases.
This work paves the way towards more compute-efficient pLMs, democratizing their training and practical application in computational biology.
arXiv Detail & Related papers (2024-06-11T13:32:11Z) - Tele-FLM Technical Report [96.19923831660266]
We introduce Tele-FLM (aka FLM-2), a 52B open-sourced multilingual large language model.
It features a stable, efficient pre-training paradigm and enhanced factual judgment capabilities.
It is comparable to strong open-sourced models that involve larger pre-training FLOPs, such as Llama2-70B and DeepSeek-67B.
arXiv Detail & Related papers (2024-04-25T14:34:47Z) - ChatGPT for Arabic Grammatical Error Correction [5.945320097465418]
Large language models (LLMs) fine-tuned to follow human instruction have exhibited significant capabilities in English NLP tasks.
In this paper, we delve into abilities of instruction fine-tuned LLMs in Arabic GEC, a task made complex due to Arabic's rich morphology.
We find that instruction fine-tuned models, regardless of their size, significantly underperform compared to fully fine-tuned models of significantly smaller sizes.
arXiv Detail & Related papers (2023-08-08T18:00:39Z) - Small Language Models Improve Giants by Rewriting Their Outputs [18.025736098795296]
We tackle the problem of leveraging training data to improve the performance of large language models (LLMs) without fine-tuning.
We create a pool of candidates from the LLM through few-shot prompting and we employ a compact model, the LM-corrector (LMCor), specifically trained to merge these candidates to produce an enhanced output.
Experiments on four natural language generation tasks demonstrate that even a small LMCor model (250M) substantially improves the few-shot performance of LLMs (62B), matching and even outperforming standard fine-tuning.
arXiv Detail & Related papers (2023-05-22T22:07:50Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - TALM: Tool Augmented Language Models [28.483609366116525]
Transformer based language models (LMs) demonstrate increasing performance with scale across a wide variety of tasks.
We present Tool Augmented Language Models (TALM), combining a text-only approach to augment language models with non-differentiable tools.
TALM exhibits strong performance on both a knowledge-heavy QA task and a reasoning oriented math task with simple tools.
arXiv Detail & Related papers (2022-05-24T17:58:13Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - CPM-2: Large-scale Cost-effective Pre-trained Language Models [71.59893315671997]
We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference.
We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch.
We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources.
arXiv Detail & Related papers (2021-06-20T15:43:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.