Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell
Checking
- URL: http://arxiv.org/abs/2305.03314v1
- Date: Fri, 5 May 2023 06:43:56 GMT
- Title: Block the Label and Noise: An N-Gram Masked Speller for Chinese Spell
Checking
- Authors: Haiyun Yang
- Abstract summary: This paper proposes an n-gram masking layer that masks current and/or surrounding tokens to avoid label leakage and error disturbance.
Experiments on SIGHAN datasets have demonstrated that the pluggable n-gram masking mechanism can improve the performance of prevalent CSC models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, Chinese Spell Checking(CSC), a task to detect erroneous characters
in a sentence and correct them, has attracted extensive interest because of its
wide applications in various NLP tasks. Most of the existing methods have
utilized BERT to extract semantic information for CSC task. However, these
methods directly take sentences with only a few errors as inputs, where the
correct characters may leak answers to the model and dampen its ability to
capture distant context; while the erroneous characters may disturb the
semantic encoding process and result in poor representations. Based on such
observations, this paper proposes an n-gram masking layer that masks current
and/or surrounding tokens to avoid label leakage and error disturbance.
Moreover, considering that the mask strategy may ignore multi-modal information
indicated by errors, a novel dot-product gating mechanism is proposed to
integrate the phonological and morphological information with semantic
representation. Extensive experiments on SIGHAN datasets have demonstrated that
the pluggable n-gram masking mechanism can improve the performance of prevalent
CSC models and the proposed methods in this paper outperform multiple powerful
state-of-the-art models.
Related papers
- ForgeryGPT: Multimodal Large Language Model For Explainable Image Forgery Detection and Localization [49.992614129625274]
ForgeryGPT is a novel framework that advances the Image Forgery Detection and localization task.
It captures high-order correlations of forged images from diverse linguistic feature spaces.
It enables explainable generation and interactive dialogue through a newly customized Large Language Model (LLM) architecture.
arXiv Detail & Related papers (2024-10-14T07:56:51Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - Improving Input-label Mapping with Demonstration Replay for In-context
Learning [67.57288926736923]
In-context learning (ICL) is an emerging capability of large autoregressive language models.
We propose a novel ICL method called Sliding Causal Attention (RdSca)
We show that our method significantly improves the input-label mapping in ICL demonstrations.
arXiv Detail & Related papers (2023-10-30T14:29:41Z) - Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust.
Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model.
We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z) - Incremental Image Labeling via Iterative Refinement [4.7590051176368915]
In particular, the existence of the semantic gap problem leads to a many-to-many mapping between the information extracted from an image and its linguistic description.
This unavoidable bias further leads to poor performance on current computer vision tasks.
We introduce a Knowledge Representation (KR)-based methodology to provide guidelines driving the labeling process.
arXiv Detail & Related papers (2023-04-18T13:37:22Z) - Hard Nominal Example-aware Template Mutual Matching for Industrial
Anomaly Detection [74.9262846410559]
textbfHard Nominal textbfExample-aware textbfTemplate textbfMutual textbfMatching (HETMM)
textitHETMM aims to construct a robust prototype-based decision boundary, which can precisely distinguish between hard-nominal examples and anomalies.
arXiv Detail & Related papers (2023-03-28T17:54:56Z) - Error-Robust Retrieval for Chinese Spelling Check [43.56073620728942]
Chinese Spelling Check (CSC) aims to detect and correct error tokens in Chinese contexts.
Previous methods may not fully leverage the existing datasets.
We introduce our plug-and-play retrieval method with error-robust information for Chinese Spelling Check.
arXiv Detail & Related papers (2022-11-15T01:55:34Z) - uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers [23.343006562849126]
We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction.
Masked pretrained language models such as BERT are introduced as the backbone model.
Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
arXiv Detail & Related papers (2022-09-15T05:57:12Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Disentangling Representations of Text by Masking Transformers [27.6903196190087]
We learn binary masks over transformer weights or hidden units to uncover subsets of features that correlate with a specific factor of variation.
We evaluate this method with respect to its ability to disentangle representations of sentiment from genre in movie reviews, "toxicity" from dialect in Tweets, and syntax from semantics.
arXiv Detail & Related papers (2021-04-14T22:45:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.