Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words
- URL: http://arxiv.org/abs/2202.12142v1
- Date: Thu, 24 Feb 2022 15:15:48 GMT
- Title: Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words
- Authors: Zhangyin Feng, Duyu Tang, Cong Zhou, Junwei Liao, Shuangzhi Wu,
Xiaocheng Feng, Bing Qin, Yunbo Cao, Shuming Shi
- Abstract summary: We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
- Score: 50.11559460111882
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The standard BERT adopts subword-based tokenization, which may break a word
into two or more wordpieces (e.g., converting "lossless" to "loss" and "less").
This will bring inconvenience in following situations: (1) what is the best way
to obtain the contextual vector of a word that is divided into multiple
wordpieces? (2) how to predict a word via cloze test without knowing the number
of wordpieces in advance? In this work, we explore the possibility of
developing BERT-style pretrained model over a vocabulary of words instead of
wordpieces. We call such word-level BERT model as WordBERT. We train models
with different vocabulary sizes, initialization configurations and languages.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes
significant improvements on cloze test and machine reading comprehension. On
many other natural language understanding tasks, including POS tagging,
chunking and NER, WordBERT consistently performs better than BERT. Model
analysis indicates that the major advantage of WordBERT over BERT lies in the
understanding for low-frequency words and rare words. Furthermore, since the
pipeline is language-independent, we train WordBERT for Chinese language and
obtain significant gains on five natural language understanding datasets.
Lastly, the analyse on inference speed illustrates WordBERT has comparable time
cost to BERT in natural language understanding tasks.
Related papers
- Translate to Disambiguate: Zero-shot Multilingual Word Sense
Disambiguation with Pretrained Language Models [67.19567060894563]
Pretrained Language Models (PLMs) learn rich cross-lingual knowledge and can be finetuned to perform well on diverse tasks.
We present a new study investigating how well PLMs capture cross-lingual word sense with Contextual Word-Level Translation (C-WLT)
We find that as the model size increases, PLMs encode more cross-lingual word sense knowledge and better use context to improve WLT performance.
arXiv Detail & Related papers (2023-04-26T19:55:52Z) - MarkBERT: Marking Word Boundaries Improves Chinese BERT [67.53732128091747]
MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words.
Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
arXiv Detail & Related papers (2022-03-12T08:43:06Z) - Lacking the embedding of a word? Look it up into a traditional
dictionary [0.2624902795082451]
We propose to use definitions retrieved in traditional dictionaries to produce word embeddings for rare words.
DefiNNet and DefBERT significantly outperform state-of-the-art as well as baseline methods for producing embeddings of unknown words.
arXiv Detail & Related papers (2021-09-24T06:27:58Z) - CharBERT: Character-aware Pre-trained Language Model [36.9333890698306]
We propose a character-aware pre-trained language model named CharBERT.
We first construct the contextual word embedding for each token from the sequential character representations.
We then fuse the representations of characters and the subword representations by a novel heterogeneous interaction module.
arXiv Detail & Related papers (2020-11-03T07:13:06Z) - CharacterBERT: Reconciling ELMo and BERT for Word-Level Open-Vocabulary
Representations From Characters [14.956626084281638]
We propose a new variant of BERT that drops the wordpiece system altogether and uses a Character-CNN module instead to represent entire words by consulting their characters.
We show that this new model improves the performance of BERT on a variety of medical domain tasks while at the same time producing robust, word-level and open-vocabulary representations.
arXiv Detail & Related papers (2020-10-20T15:58:53Z) - It's not Greek to mBERT: Inducing Word-Level Translations from
Multilingual BERT [54.84185432755821]
multilingual BERT (mBERT) learns rich cross-lingual representations, that allow for transfer across languages.
We study the word-level translation information embedded in mBERT and present two simple methods that expose remarkable translation capabilities with no fine-tuning.
arXiv Detail & Related papers (2020-10-16T09:49:32Z) - BERT for Monolingual and Cross-Lingual Reverse Dictionary [56.8627517256663]
We propose a simple but effective method to make BERT generate the target word for this specific task.
By using the BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding.
arXiv Detail & Related papers (2020-09-30T17:00:10Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.