OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model
Pre-trained from Scratch
- URL: http://arxiv.org/abs/2309.10706v2
- Date: Sun, 1 Oct 2023 18:21:58 GMT
- Title: OpenBA: An Open-sourced 15B Bilingual Asymmetric seq2seq Model
Pre-trained from Scratch
- Authors: Juntao Li, Zecheng Tang, Yuyang Ding, Pinzheng Wang, Pei Guo, Wangjie
You, Dan Qiao, Wenliang Chen, Guohong Fu, Qiaoming Zhu, Guodong Zhou, Min
Zhang
- Abstract summary: This report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model.
We enhance OpenBA with effective and efficient techniques as well as adopt a three-stage training strategy to train the model from scratch.
Our solution can achieve very competitive performance with only 380B tokens.
- Score: 41.45002811060755
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) with billions of parameters have demonstrated
outstanding performance on various natural language processing tasks. This
report presents OpenBA, an open-sourced 15B bilingual asymmetric seq2seq model,
to contribute an LLM variant to the Chinese-oriented open-source model
community. We enhance OpenBA with effective and efficient techniques as well as
adopt a three-stage training strategy to train the model from scratch. Our
solution can also achieve very competitive performance with only 380B tokens,
which is better than LLaMA-70B on the BELEBELE benchmark, BLOOM-176B on the
MMLU benchmark, GLM-130B on the C-Eval (hard) benchmark. This report provides
the main details to pre-train an analogous model, including pre-training data
processing, Bilingual Flan data collection, the empirical observations that
inspire our model architecture design, training objectives of different stages,
and other enhancement techniques. Additionally, we also provide the fine-tuning
details of OpenBA on four downstream tasks. We have refactored our code to
follow the design principles of the Huggingface Transformers Library, making it
more convenient for developers to use, and released checkpoints of different
training stages at https://huggingface.co/openBA. More details of our project
are available at https://github.com/OpenNLG/openBA.git.
Related papers
- LBC: Language-Based-Classifier for Out-Of-Variable Generalization [14.033963471962823]
Large Language Models (LLMs) have great success in natural language processing tasks such as response generation.
We find that the pre-trained knowledge of LLMs enables them to interpret new variables that appear in a test without additional training.
We propose a Language-Based-Classifier (LBC) to maximize the benefits of LLMs to outperform TMLs on Out-of-Variable tasks.
arXiv Detail & Related papers (2024-08-20T15:05:02Z) - xGen-MM (BLIP-3): A Family of Open Large Multimodal Models [157.44696790158784]
This report introduces xGen-MM, a framework for developing Large Multimodal Models (LMMs)
The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs.
Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks.
arXiv Detail & Related papers (2024-08-16T17:57:01Z) - OpenBA-V2: Reaching 77.3% High Compression Ratio with Fast Multi-Stage Pruning [47.37717859805702]
We introduce OpenBA-V2, a 3.4B model derived from multi-stage compression and continual pre-training from the original 15B OpenBA model.
OpenBA-V2 utilizes more data, more flexible training objectives, and techniques such as layer pruning, neural pruning, and vocabulary pruning to achieve a compression rate of 77.3% with minimal performance loss.
arXiv Detail & Related papers (2024-05-09T17:53:28Z) - OpenELM: An Efficient Language Model Family with Open Training and Inference Framework [26.741510071520658]
We release OpenELM, a state-of-the-art open language model.
With a parameter budget of approximately one billion parameters, OpenELM exhibits a 2.36% improvement in accuracy compared to OLMo.
arXiv Detail & Related papers (2024-04-22T23:12:03Z) - YAYI 2: Multilingual Open-Source Large Language Models [53.92832054643197]
We propose YAYI 2, including both base and chat models, with 30 billion parameters.
YAYI 2 is pre-trained from scratch on a multilingual corpus which contains 2.65 trillion tokens filtered by our pre-training data processing pipeline.
The base model is aligned with human values through supervised fine-tuning with millions of instructions and reinforcement learning from human feedback.
arXiv Detail & Related papers (2023-12-22T17:34:47Z) - Prompt2Model: Generating Deployable Models from Natural Language
Instructions [74.19816829003729]
Large language models (LLMs) enable system builders to create competent NLP systems through prompting.
In other ways, LLMs are a step backward from traditional special-purpose NLP models.
We propose Prompt2Model, a general-purpose method that takes a natural language task description like the prompts provided to LLMs.
arXiv Detail & Related papers (2023-08-23T17:28:21Z) - PolyLM: An Open Source Polyglot Large Language Model [57.64420154135178]
We present PolyLM, a multilingual large language model (LLMs) trained on 640 billion (B) tokens, avaliable in two model sizes: 1.7B and 13B.
To enhance its multilingual capabilities, we 1) integrate bilingual data into training data; and 2) adopt a curriculum learning strategy that increases the proportion of non-English data from 30% in the first stage to 60% in the final stage during pre-training.
Further, we propose a multilingual self-instruct method which automatically generates 132.7K diverse multilingual instructions for model fine-tuning.
arXiv Detail & Related papers (2023-07-12T09:00:37Z) - Data-Efficient French Language Modeling with CamemBERTa [0.0]
We introduce CamemBERTa, a French DeBERTa model that builds upon the DeBERTaV3 architecture and training objective.
We evaluate our model's performance on a variety of French downstream tasks and datasets.
arXiv Detail & Related papers (2023-06-02T12:45:34Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.