Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models
- URL: http://arxiv.org/abs/2303.10893v2
- Date: Wed, 22 Mar 2023 03:20:27 GMT
- Title: Character, Word, or Both? Revisiting the Segmentation Granularity for
Chinese Pre-trained Language Models
- Authors: Xinnian Liang, Zefan Zhou, Hui Huang, Shuangzhi Wu, Tong Xiao, Muyun
Yang, Zhoujun Li, Chao Bian
- Abstract summary: We propose a mixedgranularity Chinese BERT (MigBERT) by considering both characters and words.
We conduct extensive experiments on various Chinese NLP tasks to evaluate existing PLMs as well as the proposed MigBERT.
MigBERT achieves new SOTA performance on all these tasks.
- Score: 42.75756994523378
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pretrained language models (PLMs) have shown marvelous improvements across
various NLP tasks. Most Chinese PLMs simply treat an input text as a sequence
of characters, and completely ignore word information. Although Whole Word
Masking can alleviate this, the semantics in words is still not well
represented. In this paper, we revisit the segmentation granularity of Chinese
PLMs. We propose a mixed-granularity Chinese BERT (MigBERT) by considering both
characters and words. To achieve this, we design objective functions for
learning both character and word-level representations. We conduct extensive
experiments on various Chinese NLP tasks to evaluate existing PLMs as well as
the proposed MigBERT. Experimental results show that MigBERT achieves new SOTA
performance on all these tasks. Further analysis demonstrates that words are
semantically richer than characters. More interestingly, we show that MigBERT
also works with Japanese. Our code and model have been released
here~\footnote{https://github.com/xnliang98/MigBERT}.
Related papers
- Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - LERT: A Linguistically-motivated Pre-trained Language Model [67.65651497173998]
We propose LERT, a pre-trained language model that is trained on three types of linguistic features along with the original pre-training task.
We carried out extensive experiments on ten Chinese NLU tasks, and the experimental results show that LERT could bring significant improvements.
arXiv Detail & Related papers (2022-11-10T05:09:16Z) - CLOWER: A Pre-trained Language Model with Contrastive Learning over Word
and Character Representations [18.780841483220986]
Pre-trained Language Models (PLMs) have achieved remarkable performance gains across numerous downstream tasks in natural language understanding.
Most current models use Chinese characters as inputs and are not able to encode semantic information contained in Chinese words.
We propose a simple yet effective PLM CLOWER, which adopts the Contrastive Learning Over Word and charactER representations.
arXiv Detail & Related papers (2022-08-23T09:52:34Z) - "Is Whole Word Masking Always Better for Chinese BERT?": Probing on
Chinese Grammatical Error Correction [58.40808660657153]
We investigate whether whole word masking (WWM) leads to better context understanding ability for Chinese BERT.
We construct a dataset including labels for 19,075 tokens in 10,448 sentences.
We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively.
arXiv Detail & Related papers (2022-03-01T08:24:56Z) - SHUOWEN-JIEZI: Linguistically Informed Tokenizers For Chinese Language
Model Pretraining [48.880840711568425]
We study the influences of three main factors on the Chinese tokenization for pretrained language models.
We propose three kinds of tokenizers: SHUOWEN (meaning Talk Word), the pronunciation-based tokenizers; 2) JIEZI (meaning Solve Character), the glyph-based tokenizers.
We find that SHUOWEN and JIEZI tokenizers can generally outperform conventional single-character tokenizers.
arXiv Detail & Related papers (2021-06-01T11:20:02Z) - LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short
Text Matching [29.318730227080675]
We introduce HowNet as an external knowledge base and propose a Linguistic knowledge Enhanced graph Transformer (LET) to deal with word ambiguity.
Experimental results on two Chinese datasets show that our models outperform various typical text matching approaches.
arXiv Detail & Related papers (2021-02-25T04:01:51Z) - MVP-BERT: Redesigning Vocabularies for Chinese BERT and Multi-Vocab
Pretraining [5.503321733964237]
We first propose a novel method, emphseg_tok, to form the vocabulary of Chinese BERT, with the help of Chinese word segmentation (CWS) and subword tokenization.
Experiments show that emphseg_tok does not only improves the performances of Chinese PLMs on sentence level tasks, it can also improve efficiency.
arXiv Detail & Related papers (2020-11-17T10:15:36Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.