Related papers: The Falcon Series of Open Language Models

The Falcon Series of Open Language Models

URL: http://arxiv.org/abs/2311.16867v2
Date: Wed, 29 Nov 2023 19:45:10 GMT
Title: The Falcon Series of Open Language Models
Authors: Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, M\'erouane Debbah, \'Etienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, Guilherme Penedo
Abstract summary: We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1.
Score: 36.93493444130304
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.

Related papers

Why Personalizing Deep Learning-Based Code Completion Tools Matters [55.39571645315926]
We consider 136 developers from two organizations (Apache and Spring), two model architectures (T5 and Code Llama), and three model sizes (60M, 750M, and 7B trainable parameters) For the Code Llama model (7B), we compared the performance of the already pre-trained model publicly available online with the same model fine-tuned on organization- and developer-specific datasets. Our results show that there is a boost in prediction capabilities provided by both an organization-specific and a developer-specific additional fine-tuning.
arXiv Detail & Related papers (2025-03-18T12:26:06Z)
Falcon2-11B Technical Report [12.473984346805011]
We introduce Falcon2-11B, a foundation model trained on over five trillion tokens, and Falcon2-11B-vlm, which is a vision-to-text model. We report our findings during the training of the Falcon2-11B which follows a multi-stage approach. We also report the effect of doubling the batch size mid-training and how training loss spikes are affected by the learning rate.
arXiv Detail & Related papers (2024-07-20T14:23:15Z)
Compact Language Models via Pruning and Knowledge Distillation [61.56557874432008]
Minitron models exhibit up to a 16% improvement in MMLU scores compared to training from scratch. Deriving 8B and 4B models from an already pretrained 15B model using our approach requires up to 40x fewer training tokens per model compared to training from scratch.
arXiv Detail & Related papers (2024-07-19T21:47:57Z)
Xmodel-LM Technical Report [13.451816134545163]
Xmodel-LM is a compact and efficient 1.1B language model pre-trained on around 2 trillion tokens. It exhibits remarkable performance despite its smaller size.
arXiv Detail & Related papers (2024-06-05T02:12:06Z)
Arctic-Embed: Scalable, Efficient, and Accurate Text Embedding Models [5.2094499417507105]
This report describes the training dataset creation and recipe behind the family of textttarctic-embed text embedding models. At the time of their release, each model achieved state-of-the-art retrieval accuracy for models of their size on the MTEB Retrieval leaderboard.
arXiv Detail & Related papers (2024-05-08T19:05:18Z)
Yi: Open Foundation Models by 01.AI [42.94680878285869]
Yi model family is based on 6B and 34B pretrained language models, then we extend them to chat models, 200K long context models, depth-upscaled models, and vision-language models. Our base models achieve strong performance on a wide range of benchmarks like MMLU, and our fine chat models deliver strong human preference rate on major evaluation platforms like AlpacaEval and Arena.
arXiv Detail & Related papers (2024-03-07T16:52:49Z)
Tandem Transformers for Inference Efficient LLMs [49.75726447408795]
We introduce a novel architecture, Tandem transformers, to address these issues. This architecture uniquely combines a small autoregressive model and a large model operating in block mode. On the PaLM2 pretraining dataset, a tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy.
arXiv Detail & Related papers (2024-02-13T18:24:08Z)
"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each. Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z)
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only [48.498376125522114]
We show that properly filtered and deduplicated web data alone can lead to powerful models. We release an extract of 600 billion tokens from our RefinedWeb dataset, and 1.3/7.5B parameters language models trained on it.
arXiv Detail & Related papers (2023-06-01T20:03:56Z)
Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization [108.09419317477986]
Z-Code++ is a new pre-trained language model optimized for abstractive text summarization. The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation. Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum.
arXiv Detail & Related papers (2022-08-21T01:00:54Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.