Related papers: Textbooks Are All You Need

Textbooks Are All You Need

URL: http://arxiv.org/abs/2306.11644v2
Date: Mon, 2 Oct 2023 06:12:30 GMT
Title: Textbooks Are All You Need
Authors: Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio C\'esar Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, S\'ebastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee, Yuanzhi Li
Abstract summary: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s. phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP.
Score: 66.17192488876695
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1 is a Transformer-based model with 1.3B parameters, trained for 4 days on 8 A100s, using a selection of ``textbook quality" data from the web (6B tokens) and synthetically generated textbooks and exercises with GPT-3.5 (1B tokens). Despite this small scale, phi-1 attains pass@1 accuracy 50.6% on HumanEval and 55.5% on MBPP. It also displays surprising emergent properties compared to phi-1-base, our model before our finetuning stage on a dataset of coding exercises, and phi-1-small, a smaller model with 350M parameters trained with the same pipeline as phi-1 that still achieves 45% on HumanEval.

Related papers

SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer [49.1761733723771]
This paper presents SANA-1.5, a linear Diffusion Transformer for efficient scaling in text-to-image generation. We introduce three key innovations: Efficient Training Scaling, Model Depth Pruning, and Inference-time Scaling. Through these strategies, SANA-1.5 achieves a text computation-image alignment score of 0.81 on GenEval, which can be further improved to 0.96 through inference scaling with VILA-Judge.
arXiv Detail & Related papers (2025-01-30T15:31:48Z)
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone [289.9290405258526]
We introduce phi-3-mini, a 3.8 billion parameter language model trained on 3.3 trillion tokens. It achieves 69% on MMLU and 8.38 on MT-bench, despite being small enough to be deployed on a phone. We introduce three models in the phi-3.5 series: phi-3.5-mini, phi-3.5-MoE, and phi-3.5-Vision.
arXiv Detail & Related papers (2024-04-22T14:32:33Z)
Language models scale reliably with over-training and on downstream tasks [121.69867718185125]
Scaling laws are useful guides for derisking expensive training runs. However, there remain gaps between current studies and how language models are trained. In contrast, scaling laws mostly predict loss on inference, but models are usually compared on downstream task performance.
arXiv Detail & Related papers (2024-03-13T13:54:00Z)
Unraveling the Mystery of Scaling Laws: Part I [39.967120253159614]
Scaling law principles indicate a power-law correlation between loss and variables such as model size, dataset size, and computational resources utilized during training. The original scaling law paper by OpenAI did not disclose the complete details necessary to derive the precise scaling law formulas. We provide step-by-step instructions to estimate all constant terms in scaling-law formulas by training on models with only 1M60M parameters.
arXiv Detail & Related papers (2024-03-11T10:05:29Z)
Textbooks Are All You Need II: phi-1.5 technical report [55.6940110946465]
We create a new 1.3 billion parameter model named textbfphi-1.5 with performance on natural language tasks comparable to models 5x larger. textbfphi-1.5 exhibits many of the traits of much larger Large Language Models. We open-source textbfphi-1.5 to promote further research on these urgent topics.
arXiv Detail & Related papers (2023-09-11T14:01:45Z)
Predicting Issue Types with seBERT [85.74803351913695]
seBERT is a model that was developed based on the BERT architecture, but trained from scratch with software engineering data. We fine-tuned this model for the NLBSE challenge for the task of issue type prediction. Our model dominates the baseline fastText for all three issue types in both recall and precisio to achieve an overall F1-score of 85.7%.
arXiv Detail & Related papers (2022-05-03T06:47:13Z)
OPT: Open Pre-trained Transformer Language Models [99.60254017109551]
We present Open Pre-trained Transformers (OPT), a suite of decoder-only pre-trained transformers ranging from 125M to 175B parameters. We show that OPT-175B is comparable to GPT-3, while requiring only 1/7th the carbon footprint to develop.
arXiv Detail & Related papers (2022-05-02T17:49:50Z)
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity [35.84448624327473]
We simplify the MoE routing algorithm and design intuitive improved models with reduced communication and computational costs. We show large sparse models may be trained, for the first time, with lower precision (bfloat16) formats. We design models based off T5-Base and T5-Large to obtain up to 7x increases in pre-training speed with the same computational resources.
arXiv Detail & Related papers (2021-01-11T16:11:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.