AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization
- URL: http://arxiv.org/abs/2008.11869v4
- Date: Thu, 27 May 2021 10:39:47 GMT
- Title: AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization
- Authors: Xinsong Zhang, Pengshuai Li, and Hang Li
- Abstract summary: We propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT)
For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization.
Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE.
- Score: 13.082435183692393
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-trained language models such as BERT have exhibited remarkable
performances in many tasks in natural language understanding (NLU). The tokens
in the models are usually fine-grained in the sense that for languages like
English they are words or sub-words and for languages like Chinese they are
characters. In English, for example, there are multi-word expressions which
form natural lexical units and thus the use of coarse-grained tokenization also
appears to be reasonable. In fact, both fine-grained and coarse-grained
tokenizations have advantages and disadvantages for learning of pre-trained
language models. In this paper, we propose a novel pre-trained language model,
referred to as AMBERT (A Multi-grained BERT), on the basis of both fine-grained
and coarse-grained tokenizations. For English, AMBERT takes both the sequence
of words (fine-grained tokens) and the sequence of phrases (coarse-grained
tokens) as input after tokenization, employs one encoder for processing the
sequence of words and the other encoder for processing the sequence of the
phrases, utilizes shared parameters between the two encoders, and finally
creates a sequence of contextualized representations of the words and a
sequence of contextualized representations of the phrases. Experiments have
been conducted on benchmark datasets for Chinese and English, including CLUE,
GLUE, SQuAD and RACE. The results show that AMBERT can outperform BERT in all
cases, particularly the improvements are significant for Chinese. We also
develop a method to improve the efficiency of AMBERT in inference, which still
performs better than BERT with the same computational cost as BERT.
Related papers
- READIN: A Chinese Multi-Task Benchmark with Realistic and Diverse Input
Noises [87.70001456418504]
We construct READIN: a Chinese multi-task benchmark with REalistic And Diverse Input Noises.
READIN contains four diverse tasks and requests annotators to re-enter the original test data with two commonly used Chinese input methods: Pinyin input and speech input.
We experiment with a series of strong pretrained language models as well as robust training methods, we find that these models often suffer significant performance drops on READIN.
arXiv Detail & Related papers (2023-02-14T20:14:39Z) - CLOWER: A Pre-trained Language Model with Contrastive Learning over Word
and Character Representations [18.780841483220986]
Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding.
Most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words.
We propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations.
arXiv Detail & Related papers (2022-08-23T09:52:34Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - Language Identification of Hindi-English tweets using code-mixed BERT [0.0]
The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification.
The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
arXiv Detail & Related papers (2021-07-02T17:51:36Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - Looking for Clues of Language in Multilingual BERT to Improve
Cross-lingual Generalization [56.87201892585477]
Token embeddings in multilingual BERT (m-BERT) contain both language and semantic information.
We control the output languages of multilingual BERT by manipulating the token embeddings.
arXiv Detail & Related papers (2020-10-20T05:41:35Z) - CERT: Contrastive Self-supervised Learning for Language Understanding [20.17416958052909]
We propose CERT: Contrastive self-supervised Representations from Transformers.
CERT pretrains language representation models using contrastive self-supervised learning at the sentence level.
We evaluate CERT on 11 natural language understanding tasks in the GLUE benchmark where CERT outperforms BERT on 7 tasks, achieves the same performance as BERT on 2 tasks, and performs worse than BERT on 2 tasks.
arXiv Detail & Related papers (2020-05-16T16:20:38Z) - 2kenize: Tying Subword Sequences for Chinese Script Conversion [54.33749520569979]
We propose a model that can disambiguate between mappings and convert between the two scripts.
Our proposed method outperforms previous Chinese Character conversion approaches by 6 points in accuracy.
arXiv Detail & Related papers (2020-05-07T10:53:05Z) - BURT: BERT-inspired Universal Representation from Twin Structure [89.82415322763475]
BURT (BERT inspired Universal Representation from Twin Structure) is capable of generating universal, fixed-size representations for input sequences of any granularity.
Our proposed BURT adopts the Siamese network, learning sentence-level representations from natural language inference dataset and word/phrase-level representations from paraphrasing dataset.
We evaluate BURT across different granularities of text similarity tasks, including STS tasks, SemEval2013 Task 5(a) and some commonly used word similarity tasks.
arXiv Detail & Related papers (2020-04-29T04:01:52Z) - Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE)
We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages.
We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.