Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and
Few-Shot Learning
- URL: http://arxiv.org/abs/2110.04725v2
- Date: Tue, 12 Oct 2021 02:25:35 GMT
- Title: Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and
Few-Shot Learning
- Authors: Shaohua Wu, Xudong Zhao, Tong Yu, Rongguo Zhang, Chong Shen, Hongli
Liu, Feng Li, Hong Zhu, Jiangang Luo, Liang Xu, Xuanwei Zhang
- Abstract summary: Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot and Few-Shot learning on many natural language processing (NLP) tasks.
We propose a method that incorporates large-scale distributed training performance into model architecture design.
- Score: 18.932100477957462
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work like GPT-3 has demonstrated excellent performance of Zero-Shot
and Few-Shot learning on many natural language processing (NLP) tasks by
scaling up model size, dataset size and the amount of computation. However,
training a model like GPT-3 requires huge amount of computational resources
which makes it challengeable to researchers. In this work, we propose a method
that incorporates large-scale distributed training performance into model
architecture design. With this method, Yuan 1.0, the current largest singleton
language model with 245B parameters, achieves excellent performance on
thousands GPUs during training, and the state-of-the-art results on NLP tasks.
A data processing method is designed to efficiently filter massive amount of
raw data. The current largest high-quality Chinese corpus with 5TB high quality
texts is built based on this method. In addition, a calibration and label
expansion method is proposed to improve the Zero-Shot and Few-Shot performance,
and steady improvement is observed on the accuracy of various tasks. Yuan 1.0
presents strong capacity of natural language generation, and the generated
articles are difficult to distinguish from the human-written ones.
Related papers
- Benchmarking Pre-trained Large Language Models' Potential Across Urdu NLP tasks [0.9786690381850356]
Large Language Models (LLMs) pre-trained on multilingual data have revolutionized natural language processing research.
This study presents an in-depth examination of prominent LLMs, across 14 tasks using 15 Urdu datasets.
Experiments show that SOTA models surpass all the encoder-decoder pre-trained language models in all Urdu NLP tasks with zero-shot learning.
arXiv Detail & Related papers (2024-05-24T11:30:37Z) - Text Quality-Based Pruning for Efficient Training of Language Models [66.66259229732121]
We propose a novel method for numerically evaluating text quality in large unlabelled NLP datasets.
By proposing the text quality metric, the paper establishes a framework to identify and eliminate low-quality text instances.
Experimental results over multiple models and datasets demonstrate the efficacy of this approach.
arXiv Detail & Related papers (2024-04-26T18:01:25Z) - Smaller Language Models are capable of selecting Instruction-Tuning
Training Data for Larger Language Models [39.65879784788677]
We introduce a novel training data selection based on the learning percentage of the samples.
We assert that current language models possess the capability to autonomously select high-quality training data.
Our paper introduces a novel approach to training data selection, showcasing a more efficient alternative.
arXiv Detail & Related papers (2024-02-16T03:39:37Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Training Large Language Models Efficiently with Sparsity and Dataflow [3.1780195670658378]
This paper demonstrates an end-to-end training flow on a large language model - 13 billion GPT - using sparsity and dataflow.
We show that we can successfully train GPT 13B to the same quality as the dense GPT 13B model, while achieving an end-end speedup of 4.5x over dense A100 baseline.
arXiv Detail & Related papers (2023-04-11T21:37:13Z) - Selective In-Context Data Augmentation for Intent Detection using
Pointwise V-Information [100.03188187735624]
We introduce a novel approach based on PLMs and pointwise V-information (PVI), a metric that can measure the usefulness of a datapoint for training a model.
Our method first fine-tunes a PLM on a small seed of training data and then synthesizes new datapoints - utterances that correspond to given intents.
Our method is thus able to leverage the expressive power of large language models to produce diverse training data.
arXiv Detail & Related papers (2023-02-10T07:37:49Z) - PaLM: Scaling Language Modeling with Pathways [180.69584031908113]
We trained a 540-billion parameter, densely activated, Transformer language model, which we call Pathways Language Model PaLM.
We trained PaLM on 6144 TPU v4 chips using Pathways, a new ML system which enables highly efficient training across multiple TPU Pods.
We demonstrate continued benefits of scaling by achieving state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks.
arXiv Detail & Related papers (2022-04-05T16:11:45Z) - GLaM: Efficient Scaling of Language Models with Mixture-of-Experts [84.33607245023049]
We propose and develop a family of language models named GLaM (Generalist Language Model)
GLaM uses a sparsely activated mixture-of-experts architecture to scale the model capacity while also incurring substantially less training cost compared to dense variants.
It consumes only 1/3 of the energy used to train GPT-3 and requires half of the flops for inference, while still achieving better overall zero-shot and one-shot performance across 29 NLP tasks.
arXiv Detail & Related papers (2021-12-13T18:58:19Z) - ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language
Understanding and Generation [25.430130072811075]
We propose a unified framework named ERNIE 3.0 for pre-training large-scale knowledge enhanced models.
It fuses auto-regressive network and auto-encoding network, so that the trained model can be easily tailored for both natural language understanding and generation tasks.
We trained the model with 10 billion parameters on a 4TB corpus consisting of plain texts and a large-scale knowledge graph.
arXiv Detail & Related papers (2021-07-05T16:54:59Z) - Language Models are Few-Shot Learners [61.36677350504291]
We show that scaling up language models greatly improves task-agnostic, few-shot performance.
We train GPT-3, an autoregressive language model with 175 billion parameters, and test its performance in the few-shot setting.
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks.
arXiv Detail & Related papers (2020-05-28T17:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.