A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model
- URL: http://arxiv.org/abs/2201.10707v1
- Date: Wed, 26 Jan 2022 02:10:32 GMT
- Title: A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model
- Authors: Xin Sun, Tao Ge, Shuming Ma, Jingjing Li, Furu Wei, Houfeng Wang
- Abstract summary: We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
- Score: 100.67378875773495
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Synthetic data construction of Grammatical Error Correction (GEC) for
non-English languages relies heavily on human-designed and language-specific
rules, which produce limited error-corrected patterns. In this paper, we
propose a generic and language-independent strategy for multilingual GEC, which
can train a GEC system effectively for a new non-English language with only two
easy-to-access resources: 1) a pretrained cross-lingual language model (PXLM)
and 2) parallel translation data between English and the language. Our approach
creates diverse parallel GEC data without any language-specific operations by
taking the non-autoregressive translation generated by PXLM and the gold
translation as error-corrected sentence pairs. Then, we reuse PXLM to
initialize the GEC model and pretrain it with the synthetic data generated by
itself, which yields further improvement. We evaluate our approach on three
public benchmarks of GEC in different languages. It achieves the
state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains
competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian).
Further analysis demonstrates that our data construction method is
complementary to rule-based approaches.
Related papers
- LLM-based Code-Switched Text Generation for Grammatical Error Correction [3.4457319208816224]
This work explores the complexities of applying Grammatical Error Correction systems to code-switching (CSW) texts.
We evaluate state-of-the-art GEC systems on an authentic CSW dataset from English as a Second Language learners.
We develop a model capable of correcting grammatical errors in monolingual and CSW texts.
arXiv Detail & Related papers (2024-10-14T10:07:29Z) - Open Generative Large Language Models for Galician [1.3049334790726996]
Large language models (LLMs) have transformed natural language processing.
Yet, their predominantly English-centric training has led to biases and performance disparities across languages.
This imbalance marginalizes minoritized languages, making equitable access to NLP technologies more difficult for languages with lower resources, such as Galician.
We present the first two generative LLMs focused on Galician to bridge this gap.
arXiv Detail & Related papers (2024-06-19T23:49:56Z) - Grammatical Error Correction for Code-Switched Sentences by Learners of English [5.653145656597412]
We conduct the first exploration into the use of Grammar Error Correction systems on CSW text.
We generate synthetic CSW GEC datasets by translating different spans of text within existing GEC corpora.
We then investigate different methods of selecting these spans based on CSW ratio, switch-point factor and linguistic constraints.
Our best model achieves an average increase of 1.57 $F_0.5$ across 3 CSW test sets without affecting the model's performance on a monolingual dataset.
arXiv Detail & Related papers (2024-04-18T20:05:30Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Are Pre-trained Language Models Useful for Model Ensemble in Chinese
Grammatical Error Correction? [10.302225525539003]
We explore several ensemble strategies based on strong PLMs with four sophisticated single models.
The performance does not improve but even gets worse after the PLM-based ensemble.
arXiv Detail & Related papers (2023-05-24T14:18:52Z) - Advancements in Arabic Grammatical Error Detection and Correction: An
Empirical Investigation [12.15509670220182]
Grammatical error correction (GEC) is a well-explored problem in English.
Research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity.
We present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-24T05:12:58Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Cross-lingual Machine Reading Comprehension with Language Branch
Knowledge Distillation [105.41167108465085]
Cross-lingual Machine Reading (CLMRC) remains a challenging problem due to the lack of large-scale datasets in low-source languages.
We propose a novel augmentation approach named Language Branch Machine Reading (LBMRC)
LBMRC trains multiple machine reading comprehension (MRC) models proficient in individual language.
We devise a multilingual distillation approach to amalgamate knowledge from multiple language branch models to a single model for all target languages.
arXiv Detail & Related papers (2020-10-27T13:12:17Z) - Mixed-Lingual Pre-training for Cross-lingual Summarization [54.4823498438831]
Cross-lingual Summarization aims at producing a summary in the target language for an article in the source language.
We propose a solution based on mixed-lingual pre-training that leverages both cross-lingual tasks like translation and monolingual tasks like masked language models.
Our model achieves an improvement of 2.82 (English to Chinese) and 1.15 (Chinese to English) ROUGE-1 scores over state-of-the-art results.
arXiv Detail & Related papers (2020-10-18T00:21:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.