Unified Multi-Criteria Chinese Word Segmentation with BERT
- URL: http://arxiv.org/abs/2004.05808v1
- Date: Mon, 13 Apr 2020 07:50:04 GMT
- Title: Unified Multi-Criteria Chinese Word Segmentation with BERT
- Authors: Zhen Ke, Liang Shi, Erli Meng, Bin Wang, Xipeng Qiu, Xuanjing Huang
- Abstract summary: Multi-Criteria Chinese Word aims at finding word boundaries in a Chinese sentence composed of continuous characters.
In this paper, we combine the superiority of the unified framework and pretrained language model, and propose a unified MCCWS model based on BERT.
Experiments on eight datasets with diverse criteria demonstrate that our methods could achieve new state-of-the-art results for MCCWS.
- Score: 82.16846720508748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-Criteria Chinese Word Segmentation (MCCWS) aims at finding word
boundaries in a Chinese sentence composed of continuous characters while
multiple segmentation criteria exist. The unified framework has been widely
used in MCCWS and shows its effectiveness. Besides, the pre-trained BERT
language model has been also introduced into the MCCWS task in a multi-task
learning framework. In this paper, we combine the superiority of the unified
framework and pretrained language model, and propose a unified MCCWS model
based on BERT. Moreover, we augment the unified BERT-based MCCWS model with the
bigram features and an auxiliary criterion classification task. Experiments on
eight datasets with diverse criteria demonstrate that our methods could achieve
new state-of-the-art results for MCCWS.
Related papers
- Rich Semantic Knowledge Enhanced Large Language Models for Few-shot Chinese Spell Checking [21.799697177859898]
In this paper, we explore using an in-context learning method named RS-LLM (Rich Semantic based LLMs) to introduce large language models (LLMs) as the foundation model.
We found that by introducing a small number of specific Chinese rich semantic structures, LLMs achieve better performance than the BERT-based model on few-shot CSC task.
arXiv Detail & Related papers (2024-03-13T12:55:43Z) - FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Multi-level Distillation of Semantic Knowledge for Pre-training
Multilingual Language Model [15.839724725094916]
Multi-level Multilingual Knowledge Distillation (MMKD) is a novel method for improving multilingual language models.
We employ a teacher-student framework to adopt rich semantic representation knowledge in English BERT.
We conduct experiments on cross-lingual evaluation benchmarks including XNLI, PAWS-X, and XQuAD.
arXiv Detail & Related papers (2022-11-02T15:23:13Z) - SMTCE: A Social Media Text Classification Evaluation Benchmark and
BERTology Models for Vietnamese [3.0938904602244355]
We introduce the Social Media Text Classification Evaluation (SMTCE) benchmark, as a collection of datasets and models across a diverse set of SMTC tasks.
We implement and analyze the effectiveness of a variety of multilingual BERT-based models and monolingual BERT-based models for tasks in the benchmark.
It provides an objective assessment of multilingual and monolingual BERT-based models on the benchmark, which will benefit future studies about BERTology in the Vietnamese language.
arXiv Detail & Related papers (2022-09-21T16:33:46Z) - Many-Class Text Classification with Matching [65.74328417321738]
We formulate textbfText textbfClassification as a textbfMatching problem between the text and the labels, and propose a simple yet effective framework named TCM.
Compared with previous text classification approaches, TCM takes advantage of the fine-grained semantic information of the classification labels.
arXiv Detail & Related papers (2022-05-23T15:51:19Z) - A Variational Hierarchical Model for Neural Cross-Lingual Summarization [85.44969140204026]
Cross-lingual summarization () is to convert a document in one language to a summary in another one.
Existing studies on CLS mainly focus on utilizing pipeline methods or jointly training an end-to-end model.
We propose a hierarchical model for the CLS task, based on the conditional variational auto-encoder.
arXiv Detail & Related papers (2022-03-08T02:46:11Z) - On Cross-Lingual Retrieval with Multilingual Text Encoders [51.60862829942932]
We study the suitability of state-of-the-art multilingual encoders for cross-lingual document and sentence retrieval tasks.
We benchmark their performance in unsupervised ad-hoc sentence- and document-level CLIR experiments.
We evaluate multilingual encoders fine-tuned in a supervised fashion (i.e., we learn to rank) on English relevance data in a series of zero-shot language and domain transfer CLIR experiments.
arXiv Detail & Related papers (2021-12-21T08:10:27Z) - Combining Deep Generative Models and Multi-lingual Pretraining for
Semi-supervised Document Classification [49.47925519332164]
We combine semi-supervised deep generative models and multi-lingual pretraining to form a pipeline for document classification task.
Our framework is highly competitive and outperforms the state-of-the-art counterparts in low-resource settings across several languages.
arXiv Detail & Related papers (2021-01-26T11:26:14Z) - Pre-training with Meta Learning for Chinese Word Segmentation [44.872788258481755]
We propose a CWS-specific pre-trained model METASEG, which employs a unified architecture and incorporates meta learning algorithm into a multi-criteria pre-training task.
METASEG can achieve new state-of-the-art performance on twelve widely-used CWS datasets.
arXiv Detail & Related papers (2020-10-23T10:00:46Z) - Cross-lingual Information Retrieval with BERT [8.052497255948046]
We explore the use of the popular bidirectional language model, BERT, to model and learn the relevance between English queries and foreign-language documents.
A deep relevance matching model based on BERT is introduced and trained by finetuning a pretrained multilingual BERT model with weak supervision.
Experimental results of the retrieval of Lithuanian documents against short English queries show that our model is effective and outperforms the competitive baseline approaches.
arXiv Detail & Related papers (2020-04-24T23:32:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.