Textbooks Are All You Need
- URL: http://arxiv.org/abs/2306.11644v2
- Date: Mon, 2 Oct 2023 06:12:30 GMT
- Title: Textbooks Are All You Need
- Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C\'esar Teodoro Mendes,
Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo
de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin
Wang, S\'ebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee,
Yuanzhi Li
- Abstract summary: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s.
phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP.
- Score: 66.17192488876695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce phi-1, a new large language model for code, with significantly
smaller size than competing models: phi-1 is a Transformer-based model with
1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook
quality" data from the web (6B tokens) and synthetically generated textbooks
and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains
pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays
surprising emergent properties compared to phi-1-base, our model before our
finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller
model with 350M parameters trained with the same pipeline as phi-1 that still
achieves 45% on HumanEval.
Related papers
- Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [264.1381972279132]
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens.
It achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone.
We also introduce phi-3-vision, a 4.2 billion parameter model based on phi-3-mini with strong reasoning capabilities for image and text prompts.
arXiv Detail & Related papers (2024-04-22T14:32:33Z) - Pre-training Small Base LMs with Fewer Tokens [63.81067268919042]
We study the effectiveness of a simple approach to develop a small base language model (LM) starting from an existing large base LM.
We call our simple recipe Inheritune and first demonstrate it for building a small base LM with 1.5B parameters using 1B tokens.
We show that smaller LMs trained utilizing some of the layers of GPT2-medium (355M) and GPT-2-large (770M) can effectively match the val loss of their bigger counterparts when trained from scratch.
arXiv Detail & Related papers (2024-04-12T17:53:34Z) - Textbooks Are All You Need II: phi-1.5 technical report [55.6940110946465]
We create a new 1.3 billion parameter model named textbfphi-1.5 with performance on natural language tasks comparable to models 5x larger.
textbfphi-1.5 exhibits many of the traits of much larger Large Language Models.
We open-source textbfphi-1.5 to promote further research on these urgent topics.
arXiv Detail & Related papers (2023-09-11T14:01:45Z) - Predicting Issue Types with seBERT [85.74803351913695]
seBERT is a model that was developed based on the BERT architecture, but trained from scratch with software engineering data.
We fine-tuned this model for the NLBSE challenge for the task of issue type prediction.
Our model dominates the baseline fastText for all three issue types in both recall and precisio to achieve an overall F1-score of 85.7%.
arXiv Detail & Related papers (2022-05-03T06:47:13Z) - OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters.
We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z) - Switch Transformers: Scaling to Trillion Parameter Models with Simple
and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs.
We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats.
We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.