Unsupervised Boundary-Aware Language Model Pretraining for Chinese
Sequence Labeling
- URL: http://arxiv.org/abs/2210.15231v1
- Date: Thu, 27 Oct 2022 07:38:50 GMT
- Title: Unsupervised Boundary-Aware Language Model Pretraining for Chinese
Sequence Labeling
- Authors: Peijie Jiang, Dingkun Long, Yanzhao Zhang, Pengjun Xie, Meishan Zhang,
Min Zhang
- Abstract summary: Boundary information is critical for various Chinese language processing tasks, such as word segmentation, part-of-speech tagging, and named entity recognition.
We propose an architecture to encode the information directly into pre-trained language models, resulting in Boundary-Aware BERT (BABERT)
Experimental results on ten benchmarks of Chinese sequence labeling demonstrate that BABERT can provide consistent improvements on all datasets.
- Score: 25.58155857967128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Boundary information is critical for various Chinese language processing
tasks, such as word segmentation, part-of-speech tagging, and named entity
recognition. Previous studies usually resorted to the use of a high-quality
external lexicon, where lexicon items can offer explicit boundary information.
However, to ensure the quality of the lexicon, great human effort is always
necessary, which has been generally ignored. In this work, we suggest
unsupervised statistical boundary information instead, and propose an
architecture to encode the information directly into pre-trained language
models, resulting in Boundary-Aware BERT (BABERT). We apply BABERT for feature
induction of Chinese sequence labeling tasks. Experimental results on ten
benchmarks of Chinese sequence labeling demonstrate that BABERT can provide
consistent improvements on all datasets. In addition, our method can complement
previous supervised lexicon exploration, where further improvements can be
achieved when integrated with external lexicon information.
Related papers
- Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training [45.40634271936031]
Current pre-trained language models rarely explicitly incorporate boundary information into the modeling process.
BABERT incorporates unsupervised statistical boundary information into Chinese BERT's pre-training objectives.
We introduce a novel Boundary Information Metric'' that is both simple and effective.
arXiv Detail & Related papers (2024-04-08T14:32:52Z) - Constrained Decoding for Cross-lingual Label Projection [27.567195418950966]
Cross-lingual transfer using multilingual LLMs has become a popular learning paradigm for low-resource languages with no labeled training data.
However, for NLP tasks that involve fine-grained predictions on words and phrases, the performance of zero-shot cross-lingual transfer learning lags far behind supervised fine-tuning methods.
arXiv Detail & Related papers (2024-02-05T15:57:32Z) - Do We Need Language-Specific Fact-Checking Models? The Case of Chinese [15.619421104102516]
This paper investigates the potential benefits of language-specific fact-checking models, focusing on the case of Chinese.
We first demonstrate the limitations of translation-based methods and multilingual large language models, highlighting the need for language-specific systems.
We propose a Chinese fact-checking system that can better retrieve evidence from a document by incorporating context information.
arXiv Detail & Related papers (2024-01-27T20:26:03Z) - CROP: Zero-shot Cross-lingual Named Entity Recognition with Multilingual
Labeled Sequence Translation [113.99145386490639]
Cross-lingual NER can transfer knowledge between languages via aligned cross-lingual representations or machine translation results.
We propose a Cross-lingual Entity Projection framework (CROP) to enable zero-shot cross-lingual NER.
We adopt a multilingual labeled sequence translation model to project the tagged sequence back to the target language and label the target raw sentence.
arXiv Detail & Related papers (2022-10-13T13:32:36Z) - Exposing Cross-Lingual Lexical Knowledge from Multilingual Sentence
Encoders [85.80950708769923]
We probe multilingual language models for the amount of cross-lingual lexical knowledge stored in their parameters, and compare them against the original multilingual LMs.
We also devise a novel method to expose this knowledge by additionally fine-tuning multilingual models.
We report substantial gains on standard benchmarks.
arXiv Detail & Related papers (2022-04-30T13:23:16Z) - Teach me how to Label: Labeling Functions from Natural Language with
Text-to-text Transformers [0.5330240017302619]
This paper focuses on the task of turning natural language descriptions into Python labeling functions.
We follow a novel approach to semantic parsing with pre-trained text-to-text Transformers.
Our approach can be regarded as a stepping stone towards models that are taught how to label in natural language.
arXiv Detail & Related papers (2021-01-18T16:04:15Z) - How Context Affects Language Models' Factual Predictions [134.29166998377187]
We integrate information from a retrieval system with a pre-trained language model in a purely unsupervised way.
We report that augmenting pre-trained language models in this way dramatically improves performance and that the resulting system, despite being unsupervised, is competitive with a supervised machine reading baseline.
arXiv Detail & Related papers (2020-05-10T09:28:12Z) - A Study of Cross-Lingual Ability and Language-specific Information in
Multilingual BERT [60.9051207862378]
multilingual BERT works remarkably well on cross-lingual transfer tasks.
Datasize and context window size are crucial factors to the transferability.
There is a computationally cheap but effective approach to improve the cross-lingual ability of multilingual BERT.
arXiv Detail & Related papers (2020-04-20T11:13:16Z) - On the Importance of Word Order Information in Cross-lingual Sequence
Labeling [80.65425412067464]
Cross-lingual models that fit into the word order of the source language might fail to handle target languages.
We investigate whether making models insensitive to the word order of the source language can improve the adaptation performance in target languages.
arXiv Detail & Related papers (2020-01-30T03:35:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.