MarkBERT: Marking Word Boundaries Improves Chinese BERT
- URL: http://arxiv.org/abs/2203.06378v1
- Date: Sat, 12 Mar 2022 08:43:06 GMT
- Title: MarkBERT: Marking Word Boundaries Improves Chinese BERT
- Authors: Linyang Li, Yong Dai, Duyu Tang, Zhangyin Feng, Cong Zhou, Xipeng Qiu,
Zenglin Xu, Shuming Shi
- Abstract summary: MarkBERT keeps the vocabulary being Chinese characters and inserts boundary markers between contiguous words.
Compared to previous word-based BERT models, MarkBERT achieves better accuracy on text classification, keyword recognition, and semantic similarity tasks.
- Score: 67.53732128091747
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a Chinese BERT model dubbed MarkBERT that uses word information.
Existing word-based BERT models regard words as basic units, however, due to
the vocabulary limit of BERT, they only cover high-frequency words and fall
back to character level when encountering out-of-vocabulary (OOV) words.
Different from existing works, MarkBERT keeps the vocabulary being Chinese
characters and inserts boundary markers between contiguous words. Such design
enables the model to handle any words in the same way, no matter they are OOV
words or not. Besides, our model has two additional benefits: first, it is
convenient to add word-level learning objectives over markers, which is
complementary to traditional character and sentence-level pre-training tasks;
second, it can easily incorporate richer semantics such as POS tags of words by
replacing generic markers with POS tag-specific markers. MarkBERT pushes the
state-of-the-art of Chinese named entity recognition from 95.4\% to 96.5\% on
the MSRA dataset and from 82.8\% to 84.2\% on the OntoNotes dataset,
respectively. Compared to previous word-based BERT models, MarkBERT achieves
better accuracy on text classification, keyword recognition, and semantic
similarity tasks.
Related papers
- Exploring Structured Semantic Prior for Multi Label Recognition with
Incomplete Labels [60.675714333081466]
Multi-label recognition (MLR) with incomplete labels is very challenging.
Recent works strive to explore the image-to-label correspondence in the vision-language model, ie, CLIP, to compensate for insufficient annotations.
We advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior.
arXiv Detail & Related papers (2023-03-23T12:39:20Z) - Label Semantics for Few Shot Named Entity Recognition [68.01364012546402]
We study the problem of few shot learning for named entity recognition.
We leverage the semantic information in the names of the labels as a way of giving the model additional signal and enriched priors.
Our model learns to match the representations of named entities computed by the first encoder with label representations computed by the second encoder.
arXiv Detail & Related papers (2022-03-16T23:21:05Z) - Pretraining without Wordpieces: Learning Over a Vocabulary of Millions
of Words [50.11559460111882]
We explore the possibility of developing BERT-style pretrained model over a vocabulary of words instead of wordpieces.
Results show that, compared to standard wordpiece-based BERT, WordBERT makes significant improvements on cloze test and machine reading comprehension.
Since the pipeline is language-independent, we train WordBERT for Chinese language and obtain significant gains on five natural language understanding datasets.
arXiv Detail & Related papers (2022-02-24T15:15:48Z) - Joint Chinese Word Segmentation and Part-of-speech Tagging via Two-stage
Span Labeling [0.2624902795082451]
We propose a neural model named SpanSegTag for joint Chinese word segmentation and part-of-speech tagging.
Our experiments show that our BERT-based model SpanSegTag achieved competitive performances on the CTB5, CTB6, and UD datasets.
arXiv Detail & Related papers (2021-12-17T12:59:02Z) - More Than Words: Collocation Tokenization for Latent Dirichlet
Allocation Models [71.42030830910227]
We propose a new metric for measuring the clustering quality in settings where the models differ.
We show that topics trained with merged tokens result in topic keys that are clearer, more coherent, and more effective at distinguishing topics than those unmerged models.
arXiv Detail & Related papers (2021-08-24T14:08:19Z) - Lexicon Enhanced Chinese Sequence Labelling Using BERT Adapter [15.336753753889035]
existing methods solely fuse lexicon features via a shallow and random sequence layer and do not integrate them into the bottom layers of BERT.
In this paper, we propose Lexicon Enhanced BERT (LEBERT) for Chinese sequence labelling.
Compared with the existing methods, our model achieves lexicon deep lexicon knowledge fusion at the lower layers of BERT.
arXiv Detail & Related papers (2021-05-15T06:13:39Z) - Lex-BERT: Enhancing BERT based NER with lexicons [1.6884834576352221]
We represent Lex-BERT, which incorporates the lexicon information into Chinese BERT for named entity recognition tasks.
Our model does not introduce any new parameters and are more efficient than FLAT.
arXiv Detail & Related papers (2021-01-02T07:43:21Z) - Does Chinese BERT Encode Word Structure? [17.836131968160917]
Contextualized representations give significantly improved results for a wide range of NLP tasks.
Much work has been dedicated to analyzing the features captured by representative models such as BERT.
We investigate Chinese BERT using both attention weight distribution statistics and probing tasks, finding that (1) word information is captured by BERT; (2) word-level features are mostly in the middle representation layers; (3) downstream tasks make different use of word features in BERT.
arXiv Detail & Related papers (2020-10-15T12:40:56Z) - BERT for Monolingual and Cross-Lingual Reverse Dictionary [56.8627517256663]
We propose a simple but effective method to make BERT generate the target word for this specific task.
By using the BERT (mBERT), we can efficiently conduct the cross-lingual reverse dictionary with one subword embedding.
arXiv Detail & Related papers (2020-09-30T17:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.