Related papers: OPT: Open Pre-trained Transformer Language Models

OPT: Open Pre-trained Transformer Language Models

URL: http://arxiv.org/abs/2205.01068v3
Date: Thu, 5 May 2022 11:44:30 GMT
Title: OPT: Open Pre-trained Transformer Language Models
Authors: Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, Todor Mihaylov, Myle Ott, Sam Shleifer, Kurt Shuster, Daniel Simig, Punit Singh Koura, Anjali Sridhar, Tianlu Wang, Luke Zettlemoyer
Abstract summary: We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
Score: 99.60254017109551
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models, which are often trained for hundreds of thousands of compute days, have shown remarkable capabilities for zero- and few-shot learning. Given their computational cost, these models are difficult to replicate without significant capital. For the few that are available through APIs, no access is granted to the full model weights, making them difficult to study. We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters, which we aim to fully and responsibly share with interested researchers. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop. We are also releasing our logbook detailing the infrastructure challenges we faced, along with code for experimenting with all of the released models.

Related papers

When are 1.58 bits enough? A Bottom-up Exploration of BitNet Quantization [5.67099529296254]
We show that decoder-only language models can be trained to a competitive state with ternary weights (1.58 bits per weight) Our results show that 1.58-bit training is on par with or sometimes even better than the standard 32/16-bit models.
arXiv Detail & Related papers (2024-11-08T07:24:49Z)
Pretraining Billion-scale Geospatial Foundational Models on Frontier [0.16492989697868893]
Foundation Models (FMs) are trained with internet-scale unlabeled data via self-supervised learning. We investigate billion scale FMs and HPC training profiles for geospatial applications by pretraining on publicly available data. Our larger 3B parameter size model achieves up to 30% improvement in top1 scene classification accuracy.
arXiv Detail & Related papers (2024-04-17T19:16:32Z)
Textbooks Are All You Need [66.17192488876695]
phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s. phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP.
arXiv Detail & Related papers (2023-06-20T16:14:25Z)
eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
Transformer-based World Models Are Happy With 100k Interactions [0.4588028371034407]
We apply a transformer to real-world episodes in an autoregressive manner to build a sample-efficient world model. The transformer allows our world model to access previous states directly, instead of viewing them through a compressed recurrent state. By utilizing the Transformer-XL architecture, it is able to learn long-term dependencies while staying computationally efficient.
arXiv Detail & Related papers (2023-03-13T13:43:59Z)
GLM-130B: An Open Bilingual Pre-trained Model [56.694470924635624]
We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters. It is an attempt to open-source a 100B-scale model at least as good as GPT-3 (davinci) and unveil how models of such a scale can be successfully pre-trained.
arXiv Detail & Related papers (2022-10-05T17:34:44Z)
Prune Once for All: Sparse Pre-Trained Language Models [0.6063525456640462]
We present a new method for training sparse pre-trained Transformer language models by integrating weight pruning and model distillation. These sparse pre-trained models can be used to transfer learning for a wide range of tasks while maintaining their sparsity pattern. We show how the compressed sparse pre-trained models we trained transfer their knowledge to five different downstream natural language tasks with minimal accuracy loss.
arXiv Detail & Related papers (2021-11-10T15:52:40Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
Escaping the Big Data Paradigm with Compact Transformers [7.697698018200631]
We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results.
arXiv Detail & Related papers (2021-04-12T17:58:56Z)
MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers [117.67424061746247]
We present a simple and effective approach to compress large Transformer based pre-trained models. We propose distilling the self-attention module of the last Transformer layer of the teacher, which is effective and flexible for the student. Experimental results demonstrate that our monolingual model outperforms state-of-the-art baselines in different parameter size of student models.
arXiv Detail & Related papers (2020-02-25T15:21:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.