"Is Whole Word Masking Always Better for Chinese BERT?": Probing on
Chinese Grammatical Error Correction
- URL: http://arxiv.org/abs/2203.00286v2
- Date: Wed, 2 Mar 2022 12:16:17 GMT
- Title: "Is Whole Word Masking Always Better for Chinese BERT?": Probing on
Chinese Grammatical Error Correction
- Authors: Yong Dai, Linyang Li, Cong Zhou, Zhangyin Feng, Enbo Zhao, Xipeng Qiu,
Piji Li, Duyu Tang
- Abstract summary: We investigate whether whole word masking (WWM) leads to better context understanding ability for Chinese BERT.
We construct a dataset including labels for 19,075 tokens in 10,448 sentences.
We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively.
- Score: 58.40808660657153
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whole word masking (WWM), which masks all subwords corresponding to a word at
once, makes a better English BERT model. For the Chinese language, however,
there is no subword because each token is an atomic character. The meaning of a
word in Chinese is different in that a word is a compositional unit consisting
of multiple characters. Such difference motivates us to investigate whether WWM
leads to better context understanding ability for Chinese BERT. To achieve
this, we introduce two probing tasks related to grammatical error correction
and ask pretrained models to revise or insert tokens in a masked language
modeling manner. We construct a dataset including labels for 19,075 tokens in
10,448 sentences. We train three Chinese BERT models with standard
character-level masking (CLM), WWM, and a combination of CLM and WWM,
respectively. Our major findings are as follows: First, when one character
needs to be inserted or replaced, the model trained with CLM performs the best.
Second, when more than one character needs to be handled, WWM is the key to
better performance. Finally, when being fine-tuned on sentence-level downstream
tasks, models trained with different masking strategies perform comparably.
Related papers
- Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models [42.75756994523378]
We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words.
We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT.
MigBERT achieves new SOTA performance on all these tasks.
arXiv Detail & Related papers (2023-03-20T06:20:03Z) - PERT: Pre-training BERT with Permuted Language Model [24.92527883997854]
PERT is an auto-encoding model (like BERT) trained with Permuted Language Model (PerLM)
We permute a proportion of the input text, and the training objective is to predict the position of the original token.
We carried out extensive experiments on both Chinese and English NLU benchmarks.
arXiv Detail & Related papers (2022-03-14T07:58:34Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word
Alignment [49.45399359826453]
Cross-lingual language models are typically pretrained with language modeling on multilingual text or parallel sentences.
We introduce denoising word alignment as a new cross-lingual pre-training task.
Experimental results show that our method improves cross-lingual transferability on various datasets.
arXiv Detail & Related papers (2021-06-11T13:36:01Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab
Pretraining [5.503321733964237]
We first propose a novel method, emphseg_tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization.
Experiments show that emphseg_tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency.
arXiv Detail & Related papers (2020-11-17T10:15:36Z) - AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization [13.082435183692393]
We propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT)
For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization.
Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE.
arXiv Detail & Related papers (2020-08-27T00:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.