The Falcon Series of Open Language Models
- URL: http://arxiv.org/abs/2311.16867v2
- Date: Wed, 29 Nov 2023 19:45:10 GMT
- Title: The Falcon Series of Open Language Models
- Authors: Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro
Cappelli, Ruxandra Cojocaru, M\'erouane Debbah, \'Etienne Goffinet, Daniel
Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune,
Baptiste Pannier, Guilherme Penedo
- Abstract summary: We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora.
The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run.
Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1.
- Score: 36.93493444130304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce the Falcon series: 7B, 40B, and 180B parameters causal
decoder-only models trained on a diverse high-quality corpora predominantly
assembled from web data. The largest model, Falcon-180B, has been trained on
over 3.5 trillion tokens of text--the largest openly documented pretraining
run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla,
and improves upon concurrently developed models such as LLaMA 2 or
Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining
and inference cost, making it, to our knowledge, one of the three best language
models in the world along with GPT-4 and PaLM-2-Large. We report detailed
evaluations, as well as a deep dive into the methods and custom tooling
employed to pretrain Falcon. Notably, we report on our custom distributed
training codebase, allowing us to efficiently pretrain these models on up to
4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a
600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models
under a permissive license to foster open-science and accelerate the
development of an open ecosystem of large language models.
Related papers
- Falcon2-11B Technical Report [12.473984346805011]
We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and Falcon2-11B-vlm, which is a vision-to-text model.
We report our findings during the training of the Falcon2-11B which follows a multi-stage approach.
We also report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate.
arXiv Detail & Related papers (2024-07-20T14:23:15Z) - Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch.
Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z) - Xmodel-LM Technical Report [13.451816134545163]
Xmodel-LM is a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens.
It exhibits remarkable performance despite its smaller size.
arXiv Detail & Related papers (2024-06-05T02:12:06Z) - Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models.
Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z) - Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues.
This architecture uniquely combines a small autoregressive model and a large model operating in block mode.
On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z) - "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each.
Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z) - The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora
with Web Data, and Web Data Only [48.498376125522114]
We show that properly filtered and deduplicated web data alone can lead to powerful models.
We release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
arXiv Detail & Related papers (2023-06-01T20:03:56Z) - Z-Code++: A Pre-trained Language Model Optimized for Abstractive
Summarization [108.09419317477986]
Z-Code++ is a new pre-trained language model optimized for abstractive text summarization.
The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation.
Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum.
arXiv Detail & Related papers (2022-08-21T01:00:54Z) - bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model.
bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.