Related papers: OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

OpenELM: An Efficient Language Model Family with Open Training and Inference Framework

URL: http://arxiv.org/abs/2404.14619v2
Date: Thu, 2 May 2024 00:30:57 GMT
Title: OpenELM: An Efficient Language Model Family with Open Training and Inference Framework
Authors: Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari,
Abstract summary: We release OpenELM, a state-of-the-art open language model. With a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo.
Score: 26.741510071520658
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The reproducibility and transparency of large language models are crucial for advancing open research, ensuring the trustworthiness of results, and enabling investigations into data and model biases, as well as potential risks. To this end, we release OpenELM, a state-of-the-art open language model. OpenELM uses a layer-wise scaling strategy to efficiently allocate parameters within each layer of the transformer model, leading to enhanced accuracy. For example, with a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo while requiring $2\times$ fewer pre-training tokens. Diverging from prior practices that only provide model weights and inference code, and pre-train on private datasets, our release includes the complete framework for training and evaluation of the language model on publicly available datasets, including training logs, multiple checkpoints, and pre-training configurations. We also release code to convert models to MLX library for inference and fine-tuning on Apple devices. This comprehensive release aims to empower and strengthen the open research community, paving the way for future open research endeavors. Our source code along with pre-trained model weights and training recipes is available at \url{https://github.com/apple/corenet}. Additionally, \model models can be found on HuggingFace at: \url{https://huggingface.co/apple/OpenELM}.

Related papers

Metadata Conditioning Accelerates Language Model Pre-training [76.54265482251454]
We propose a new method, termed Metadata Conditioning then Cooldown (MeCo) to incorporate additional learning cues during pre-training. MeCo significantly accelerates pre-training across different model scales (600M to 8B parameters) and training sources (C4, RefinedWeb, and DCLM) MeCo is remarkably simple, adds no computational overhead, and demonstrates promise in producing more capable and steerable language models.
arXiv Detail & Related papers (2025-01-03T18:59:23Z)
2 OLMo 2 Furious [126.72656187302502]
OLMo 2 includes dense autoregressive models with improved architecture and training recipe. Our updated pretraining data mixture introduces a new, specialized data mix called Dolmino Mix 1124. Our fully open OLMo 2-Instruct models are competitive with or surpassing open-weight only models of comparable size.
arXiv Detail & Related papers (2024-12-31T21:55:10Z)
Mixture-Models: a one-stop Python Library for Model-based Clustering using various Mixture Models [4.60168321737677]
textttMixture-Models is an open-source Python library for fitting Gaussian Mixture Models (GMM) and their variants. It streamlines the implementation and analysis of these models using various first/second order optimization routines. The library provides user-friendly model evaluation tools, such as BIC, AIC, and log-likelihood estimation.
arXiv Detail & Related papers (2024-02-08T19:34:24Z)
A Split-and-Privatize Framework for Large Language Model Fine-Tuning [7.399324195843467]
In parameter-efficient fine-tuning, only a small subset of modules are trained over the downstream datasets. We propose a Split-and-Privatize (SAP) framework, which manage to mitigate the privacy issues by adapting the existing split learning architecture. The results indicate that it can enhance the empirical privacy by 62% at the cost of 1% model performance degradation.
arXiv Detail & Related papers (2023-12-25T03:53:33Z)
The Languini Kitchen: Enabling Language Modelling Research at Different Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours. We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length. This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z)
OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model Pre-trained from Scratch [41.45002811060755]
This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model. We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch. Our solution can achieve very competitive performance with only 380B tokens.
arXiv Detail & Related papers (2023-09-19T15:46:40Z)
"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow [5.036273913335737]
We train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $$187$ and $$800$ each. Results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models.
arXiv Detail & Related papers (2023-06-05T21:38:30Z)
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM. For learning methods, we explore the claim of a "free lunch" hypothesis. For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster [0.14291940946857257]
We introduce Cerebras-GPT, a family of open compute-optimal language models scaled from 111M to 13B parameters. We characterize the predictable power-law scaling and compare Cerebras-GPT with other publicly-available models. We release our pre-trained models and code, making this paper the first open and reproducible work comparing compute-optimal model scaling to models trained on fixed dataset sizes.
arXiv Detail & Related papers (2023-04-06T16:43:16Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
DSEE: Dually Sparsity-embedded Efficient Tuning of Pre-trained Language Models [152.29364079385635]
As pre-trained models grow bigger, the fine-tuning process can be time-consuming and computationally expensive. We propose a framework for resource- and parameter-efficient fine-tuning by leveraging the sparsity prior in both weight updates and the final model weights. Our proposed framework, dubbed Dually Sparsity-Embedded Efficient Tuning (DSEE), aims to achieve two key objectives: (i) parameter efficient fine-tuning and (ii) resource-efficient inference.
arXiv Detail & Related papers (2021-10-30T03:29:47Z)
bert2BERT: Towards Reusable Pretrained Language Models [51.078081486422896]
We propose bert2BERT, which can effectively transfer the knowledge of an existing smaller pre-trained model to a large model. bert2BERT saves about 45% and 47% computational cost of pre-training BERT_BASE and GPT_BASE by reusing the models of almost their half sizes.
arXiv Detail & Related papers (2021-10-14T04:05:25Z)
Efficient Nearest Neighbor Language Models [114.40866461741795]
Non-parametric neural language models (NLMs) learn predictive distributions of text utilizing an external datastore. We show how to achieve up to a 6x speed-up in inference speed while retaining comparable performance.
arXiv Detail & Related papers (2021-09-09T12:32:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.