Z-Code++: A Pre-trained Language Model Optimized for Abstractive
Summarization
- URL: http://arxiv.org/abs/2208.09770v2
- Date: Wed, 7 Jun 2023 17:13:29 GMT
- Title: Z-Code++: A Pre-trained Language Model Optimized for Abstractive
Summarization
- Authors: Pengcheng He, Baolin Peng, Liyang Lu, Song Wang, Jie Mei, Yang Liu,
Ruochen Xu, Hany Hassan Awadalla, Yu Shi, Chenguang Zhu, Wayne Xiong, Michael
Zeng, Jianfeng Gao, Xuedong Huang
- Abstract summary: Z-Code++ is a new pre-trained language model optimized for abstractive text summarization.
The model is first pre-trained using text corpora for language understanding, and then is continually pre-trained on summarization corpora for grounded text generation.
Our model is parameter-efficient in that it outperforms the 600x larger PaLM-540B on XSum, and the finetuned 200x larger GPT3-175B on SAMSum.
- Score: 108.09419317477986
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: This paper presents Z-Code++, a new pre-trained language model optimized for
abstractive text summarization. The model extends the state of the art
encoder-decoder model using three techniques. First, we use a two-phase
pre-training process to improve model's performance on low-resource
summarization tasks. The model is first pre-trained using text corpora for
language understanding, and then is continually pre-trained on summarization
corpora for grounded text generation. Second, we replace self-attention layers
in the encoder with disentangled attention layers, where each word is
represented using two vectors that encode its content and position,
respectively. Third, we use fusion-in-encoder, a simple yet effective method of
encoding long sequences in a hierarchical manner. Z-Code++ creates new state of
the art on 9 out of 13 text summarization tasks across 5 languages. Our model
is parameter-efficient in that it outperforms the 600x larger PaLM-540B on
XSum, and the finetuned 200x larger GPT3-175B on SAMSum. In zero-shot and
few-shot settings, our model substantially outperforms the competing models.
Related papers
- Promises and Pitfalls of Generative Masked Language Modeling: Theoretical Framework and Practical Guidelines [74.42485647685272]
We focus on Generative Masked Language Models (GMLMs)
We train a model to fit conditional probabilities of the data distribution via masking, which are subsequently used as inputs to a Markov Chain to draw samples from the model.
We adapt the T5 model for iteratively-refined parallel decoding, achieving 2-3x speedup in machine translation with minimal sacrifice in quality.
arXiv Detail & Related papers (2024-07-22T18:00:00Z) - Generate to Understand for Representation [3.5325087487696463]
GUR is a pretraining framework that combines language modeling and contrastive learning objectives in a single training step.
GUR achieves impressive results without any labeled training data, outperforming all other pretrained baselines as a retriever at the recall benchmark in a zero-shot setting.
arXiv Detail & Related papers (2023-06-14T06:00:18Z) - CodeGen2: Lessons for Training LLMs on Programming and Natural Languages [116.74407069443895]
We unify encoder and decoder-based models into a single prefix-LM.
For learning methods, we explore the claim of a "free lunch" hypothesis.
For data distributions, the effect of a mixture distribution and multi-epoch training of programming and natural languages on model performance is explored.
arXiv Detail & Related papers (2023-05-03T17:55:25Z) - Designing BERT for Convolutional Networks: Sparse and Hierarchical
Masked Modeling [23.164631160130092]
We extend the success of BERT-style pre-training, or the masked image modeling, to convolutional networks (convnets)
We treat unmasked pixels as sparse voxels of 3D point clouds and use sparse convolution to encode.
This is the first use of sparse convolution for 2D masked modeling.
arXiv Detail & Related papers (2023-01-09T18:59:50Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - UniXcoder: Unified Cross-Modal Pre-training for Code Representation [65.6846553962117]
We present UniXcoder, a unified cross-modal pre-trained model for programming language.
We propose a one-to-one mapping method to transform AST in a sequence structure that retains all structural information from the tree.
We evaluate UniXcoder on five code-related tasks over nine datasets.
arXiv Detail & Related papers (2022-03-08T04:48:07Z) - Meta Learning for Code Summarization [10.403206672504664]
We show that three SOTA models for code summarization work well on largely disjoint subsets of a large code-base.
We propose three meta-models that select the best candidate summary for a given code segment.
arXiv Detail & Related papers (2022-01-20T17:23:34Z) - Pre-training for Abstractive Document Summarization by Reinstating
Source Text [105.77348528847337]
This paper presents three pre-training objectives which allow us to pre-train a Seq2Seq based abstractive summarization model on unlabeled text.
Experiments on two benchmark summarization datasets show that all three objectives can improve performance upon baselines.
arXiv Detail & Related papers (2020-04-04T05:06:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.