Related papers: MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining

MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining

URL: http://arxiv.org/abs/2011.08539v1
Date: Tue, 17 Nov 2020 10:15:36 GMT
Title: MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab Pretraining
Authors: Wei Zhu
Abstract summary: We first propose a novel method, emphseg_tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Experiments show that emphseg_tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency.
Score: 5.503321733964237
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite the development of pre-trained language models (PLMs) significantly raise the performances of various Chinese natural language processing (NLP) tasks, the vocabulary for these Chinese PLMs remain to be the one provided by Google Chinese Bert \cite{devlin2018bert}, which is based on Chinese characters. Second, the masked language model pre-training is based on a single vocabulary, which limits its downstream task performances. In this work, we first propose a novel method, \emph{seg\_tok}, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization. Then we propose three versions of multi-vocabulary pretraining (MVP) to improve the models expressiveness. Experiments show that: (a) compared with char based vocabulary, \emph{seg\_tok} does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency; (b) MVP improves PLMs' downstream performance, especially it can improve \emph{seg\_tok}'s performances on sequence labeling tasks.

Related papers

Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters. We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries. Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z)
Translate to Disambiguate: Zero-shot Multilingual Word Sense Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks. We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT) We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z)
Character, Word, or Both? Revisiting the Segmentation Granularity for Chinese Pre-trained Language Models [42.75756994523378]
We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words. We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT. MigBERT achieves new SOTA performance on all these tasks.
arXiv Detail & Related papers (2023-03-20T06:20:03Z)
CLOWER: A Pre-trained Language Model with Contrastive Learning over Word and Character Representations [18.780841483220986]
Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding. Most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words. We propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations.
arXiv Detail & Related papers (2022-08-23T09:52:34Z)
"Is Whole Word Masking Always Better for Chinese BERT?": Probing on Chinese Grammatical Error Correction [58.40808660657153]
We investigate whether whole word masking (WWM) leads to better context understanding ability for Chinese BERT. We construct a dataset including labels for 19,075 tokens in 10,448 sentences. We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively.
arXiv Detail & Related papers (2022-03-01T08:24:56Z)
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining. The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z)
SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models. We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers. We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z)
AMBERT: A Pre-trained Language Model with Multi-Grained Tokenization [13.082435183692393]
We propose a novel pre-trained language model, referred to as AMBERT (A Multi-grained BERT) For English, AMBERT takes both the sequence of words (fine-grained tokens) and the sequence of phrases (coarse-grained tokens) as input after tokenization. Experiments have been conducted on benchmark datasets for Chinese and English, including CLUE, GLUE, SQuAD and RACE.
arXiv Detail & Related papers (2020-08-27T00:23:48Z)
Byte Pair Encoding is Suboptimal for Language Model Pretraining [49.30780227162387]
We analyze differences between unigram LM tokenization and byte-pair encoding (BPE) We find that the unigram LM tokenization method matches or outperforms BPE across downstream tasks and two languages. We hope that developers of future pretrained LMs will consider adopting the unigram LM method over the more prevalent BPE.
arXiv Detail & Related papers (2020-04-07T21:21:06Z)
On the Importance of Word Order Information in Cross-lingual Sequence Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages. We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.