Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise
Attention and Gaussian Mixture Model
- URL: http://arxiv.org/abs/2312.16623v1
- Date: Wed, 27 Dec 2023 16:11:07 GMT
- Title: Make BERT-based Chinese Spelling Check Model Enhanced by Layerwise
Attention and Gaussian Mixture Model
- Authors: Yongchang Cao, Liang He, Zhen Wu, Xinyu Dai
- Abstract summary: We design a heterogeneous knowledge-infused framework to strengthen BERT-based CSC models.
We propose a novel form of n-gram-based layerwise self-attention to generate a multilayer representation.
Experimental results show that our proposed framework yields a stable performance boost over four strong baseline models.
- Score: 33.446533426654995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: BERT-based models have shown a remarkable ability in the Chinese Spelling
Check (CSC) task recently. However, traditional BERT-based methods still suffer
from two limitations. First, although previous works have identified that
explicit prior knowledge like Part-Of-Speech (POS) tagging can benefit in the
CSC task, they neglected the fact that spelling errors inherent in CSC data can
lead to incorrect tags and therefore mislead models. Additionally, they ignored
the correlation between the implicit hierarchical information encoded by BERT's
intermediate layers and different linguistic phenomena. This results in
sub-optimal accuracy. To alleviate the above two issues, we design a
heterogeneous knowledge-infused framework to strengthen BERT-based CSC models.
To incorporate explicit POS knowledge, we utilize an auxiliary task strategy
driven by Gaussian mixture model. Meanwhile, to incorporate implicit
hierarchical linguistic knowledge within the encoder, we propose a novel form
of n-gram-based layerwise self-attention to generate a multilayer
representation. Experimental results show that our proposed framework yields a
stable performance boost over four strong baseline models and outperforms the
previous state-of-the-art methods on two datasets.
Related papers
- FLIP: Fine-grained Alignment between ID-based Models and Pretrained Language Models for CTR Prediction [49.510163437116645]
Click-through rate (CTR) prediction plays as a core function module in personalized online services.
Traditional ID-based models for CTR prediction take as inputs the one-hot encoded ID features of tabular modality.
Pretrained Language Models(PLMs) has given rise to another paradigm, which takes as inputs the sentences of textual modality.
We propose to conduct Fine-grained feature-level ALignment between ID-based Models and Pretrained Language Models(FLIP) for CTR prediction.
arXiv Detail & Related papers (2023-10-30T11:25:03Z) - Pre-training Code Representation with Semantic Flow Graph for Effective
Bug Localization [4.159296619915587]
We propose a novel directed, multiple-label code graph representation named Semantic Flow Graph (SFG)
We show that our method achieves state-of-the-art performance in bug localization.
arXiv Detail & Related papers (2023-08-24T13:25:17Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Rethinking Masked Language Modeling for Chinese Spelling Correction [70.85829000570203]
We study Chinese Spelling Correction (CSC) as a joint decision made by two separate models: a language model and an error model.
We find that fine-tuning BERT tends to over-fit the error model while under-fit the language model, resulting in poor generalization to out-of-distribution error patterns.
We demonstrate that a very simple strategy, randomly masking 20% non-error tokens from the input sequence during fine-tuning is sufficient for learning a much better language model without sacrificing the error model.
arXiv Detail & Related papers (2023-05-28T13:19:12Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - CSCD-NS: a Chinese Spelling Check Dataset for Native Speakers [62.61866477815883]
We present CSCD-NS, the first Chinese spelling check dataset designed for native speakers.
CSCD-NS is ten times larger in scale and exhibits a distinct error distribution.
We propose a novel method that simulates the input process through an input method.
arXiv Detail & Related papers (2022-11-16T09:25:42Z) - CascadER: Cross-Modal Cascading for Knowledge Graph Link Prediction [22.96768147978534]
We propose a tiered ranking architecture CascadER to maintain the ranking accuracy of full ensembling while improving efficiency considerably.
CascadER uses LMs to rerank the outputs of more efficient base KGEs, relying on an adaptive subset selection scheme aimed at invoking the LMs minimally while maximizing accuracy gain over the KGE.
Our empirical analyses reveal that diversity of models across modalities and preservation of individual models' confidence signals help explain the effectiveness of CascadER.
arXiv Detail & Related papers (2022-05-16T22:55:45Z) - Roof-BERT: Divide Understanding Labour and Join in Work [7.523253052992842]
Roof-BERT is a model with two underlying BERTs and a fusion layer on them.
One of the underlying BERTs encodes the knowledge resources and the other one encodes the original input sentences.
Experiment results on QA task reveal the effectiveness of the proposed model.
arXiv Detail & Related papers (2021-12-13T15:40:54Z) - BERT4GCN: Using BERT Intermediate Layers to Augment GCN for Aspect-based
Sentiment Classification [2.982218441172364]
Graph-based Sentiment Classification (ABSC) approaches have yielded state-of-the-art results, expecially when equipped with contextual word embedding from pre-training language models (PLMs)
We propose a novel model, BERT4GCN, which integrates the grammatical sequential features from the PLM of BERT, and the syntactic knowledge from dependency graphs.
arXiv Detail & Related papers (2021-10-01T02:03:43Z) - Explaining and Improving BERT Performance on Lexical Semantic Change
Detection [22.934650688233734]
Recent success of type-based models in SemEval-2020 Task 1 has raised the question why the success of token-based models does not translate to our field.
We investigate the influence of a range of variables on clusterings of BERT vectors and show that its low performance is largely due to orthographic information on the target word.
arXiv Detail & Related papers (2021-03-12T13:29:30Z) - Incorporating BERT into Parallel Sequence Decoding with Adapters [82.65608966202396]
We propose to take two different BERT models as the encoder and decoder respectively, and fine-tune them by introducing simple and lightweight adapter modules.
We obtain a flexible and efficient model which is able to jointly leverage the information contained in the source-side and target-side BERT models.
Our framework is based on a parallel sequence decoding algorithm named Mask-Predict considering the bi-directional and conditional independent nature of BERT.
arXiv Detail & Related papers (2020-10-13T03:25:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.