Textbooks Are All You Need
- URL: http://arxiv.org/abs/2306.11644v2
- Date: Mon, 2 Oct 2023 06:12:30 GMT
- Title: Textbooks Are All You Need
- Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C\'esar Teodoro Mendes,
Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo
de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin
Wang, S\'ebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee,
Yuanzhi Li
- Abstract summary: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s.
phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP.
- Score: 66.17192488876695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce phi-1, a new large language model for code, with significantly
smaller size than competing models: phi-1 is a Transformer-based model with
1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook
quality" data from the web (6B tokens) and synthetically generated textbooks
and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains
pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays
surprising emergent properties compared to phi-1-base, our model before our
finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller
model with 350M parameters trained with the same pipeline as phi-1 that still
achieves 45% on HumanEval.
Related papers
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [289.9290405258526]
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens.
It achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.
We introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision.
arXiv Detail & Related papers (2024-04-22T14:32:33Z) - Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs.
However, there remain gaps between current studies and how language models are trained.
In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z) - Unraveling the Mystery of Scaling Laws: Part I [39.967120253159614]
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training.
The original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas.
We provide step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M60M parameters.
arXiv Detail & Related papers (2024-03-11T10:05:29Z) - Textbooks Are All You Need II: phi-1.5 technical report [55.6940110946465]
We create a new 1.3 billion parameter model named textbfphi-1.5 with performance on natural language tasks comparable to models 5x larger.
textbfphi-1.5 exhibits many of the traits of much larger Large Language Models.
We open-source textbfphi-1.5 to promote further research on these urgent topics.
arXiv Detail & Related papers (2023-09-11T14:01:45Z) - Predicting Issue Types with seBERT [85.74803351913695]
seBERT is a model that was developed based on the BERT architecture, but trained from scratch with software engineering data.
We fine-tuned this model for the NLBSE challenge for the task of issue type prediction.
Our model dominates the baseline fastText for all three issue types in both recall and precisio to achieve an overall F1-score of 85.7%.
arXiv Detail & Related papers (2022-05-03T06:47:13Z) - OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.
We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z) - Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.