Exploring the Capacity of a Large-scale Masked Language Model to
Recognize Grammatical Errors
- URL: http://arxiv.org/abs/2108.12216v1
- Date: Fri, 27 Aug 2021 10:37:14 GMT
- Title: Exploring the Capacity of a Large-scale Masked Language Model to
Recognize Grammatical Errors
- Authors: Ryo Nagata, Manabu Kimura, and Kazuaki Hanawa
- Abstract summary: We show that 5 to 10% of training data are enough for a BERT-based error detection method to achieve performance equivalent to a non-language model-based method.
We also show with pseudo error data that it actually exhibits such nice properties in learning rules for recognizing various types of error.
- Score: 3.55517579369797
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the capacity of a language model-based method for
grammatical error detection in detail. We first show that 5 to 10% of training
data are enough for a BERT-based error detection method to achieve performance
equivalent to a non-language model-based method can achieve with the full
training data; recall improves much faster with respect to training data size
in the BERT-based method than in the non-language model method while precision
behaves similarly. These suggest that (i) the BERT-based method should have a
good knowledge of grammar required to recognize certain types of error and that
(ii) it can transform the knowledge into error detection rules by fine-tuning
with a few training samples, which explains its high generalization ability in
grammatical error detection. We further show with pseudo error data that it
actually exhibits such nice properties in learning rules for recognizing
various types of error. Finally, based on these findings, we explore a
cost-effective method for detecting grammatical errors with feedback comments
explaining relevant grammatical rules to learners.
Related papers
- Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method [108.56493934296687]
We introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection.
We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text.
arXiv Detail & Related papers (2024-09-23T07:55:35Z) - Understanding and Mitigating Classification Errors Through Interpretable
Token Patterns [58.91023283103762]
Characterizing errors in easily interpretable terms gives insight into whether a classifier is prone to making systematic errors.
We propose to discover those patterns of tokens that distinguish correct and erroneous predictions.
We show that our method, Premise, performs well in practice.
arXiv Detail & Related papers (2023-11-18T00:24:26Z) - uChecker: Masked Pretrained Language Models as Unsupervised Chinese
Spelling Checkers [23.343006562849126]
We propose a framework named textbfuChecker to conduct unsupervised spelling error detection and correction.
Masked pretrained language models such as BERT are introduced as the backbone model.
Benefiting from the various and flexible MASKing operations, we propose a Confusionset-guided masking strategy to fine-train the masked language model.
arXiv Detail & Related papers (2022-09-15T05:57:12Z) - Detecting Label Errors using Pre-Trained Language Models [37.82128817976385]
We show that large pre-trained language models are extremely capable of identifying label errors in datasets.
We contribute a novel method to produce highly realistic, human-originated label noise from crowdsourced data, and demonstrate the effectiveness of this method on TweetNLP.
arXiv Detail & Related papers (2022-05-25T11:59:39Z) - Improving Pre-trained Language Models with Syntactic Dependency
Prediction Task for Chinese Semantic Error Recognition [52.55136323341319]
Existing Chinese text error detection mainly focuses on spelling and simple grammatical errors.
Chinese semantic errors are understudied and more complex that humans cannot easily recognize.
arXiv Detail & Related papers (2022-04-15T13:55:32Z) - A Syntax-Guided Grammatical Error Correction Model with Dependency Tree
Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences.
We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees.
We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z) - Grammatical Error Generation Based on Translated Fragments [0.0]
We perform neural machine translation of sentence fragments in order to create large amounts of training data for English grammatical error correction.
Our method aims at simulating mistakes made by second language learners, and produces a wider range of non-native style language.
arXiv Detail & Related papers (2021-04-20T12:43:40Z) - Spelling Error Correction with Soft-Masked BERT [11.122964733563117]
A state-of-the-art method for the task selects a character from a list of candidates for correction at each position of the sentence on the basis of BERT.
The accuracy of the method can be sub-optimal because BERT does not have sufficient capability to detect whether there is an error at each position.
We propose a novel neural architecture to address the aforementioned issue, which consists of a network for error detection and a network for error correction based on BERT.
arXiv Detail & Related papers (2020-05-15T09:02:38Z) - On the Robustness of Language Encoders against Grammatical Errors [66.05648604987479]
We collect real grammatical errors from non-native speakers and conduct adversarial attacks to simulate these errors on clean text data.
Results confirm that the performance of all tested models is affected but the degree of impact varies.
arXiv Detail & Related papers (2020-05-12T11:01:44Z) - Towards Minimal Supervision BERT-based Grammar Error Correction [81.90356787324481]
We try to incorporate contextual information from pre-trained language model to leverage annotation and benefit multilingual scenarios.
Results show strong potential of Bidirectional Representations from Transformers (BERT) in grammatical error correction task.
arXiv Detail & Related papers (2020-01-10T15:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.