A Simple Recipe for Multilingual Grammatical Error Correction
- URL: http://arxiv.org/abs/2106.03830v1
- Date: Mon, 7 Jun 2021 17:47:04 GMT
- Title: A Simple Recipe for Multilingual Grammatical Error Correction
- Authors: Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause,
Aliaksei Severyn
- Abstract summary: This paper presents a recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models.
We first propose a language-agnostic method to generate a large number of synthetic examples.
The second ingredient is to use large-scale multilingual language models.
- Score: 6.262434757334487
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a simple recipe to train state-of-the-art multilingual
Grammatical Error Correction (GEC) models. We achieve this by first proposing a
language-agnostic method to generate a large number of synthetic examples. The
second ingredient is to use large-scale multilingual language models (up to 11B
parameters). Once fine-tuned on language-specific supervised sets we surpass
the previous state-of-the-art results on GEC benchmarks in four languages:
English, Czech, German and Russian. Having established a new set of baselines
for GEC, we make our results easily reproducible and accessible by releasing a
cLang-8 dataset. It is produced by using our best model, which we call gT5, to
clean the targets of a widely used yet noisy lang-8 dataset. cLang-8 greatly
simplifies typical GEC training pipelines composed of multiple fine-tuning
stages -- we demonstrate that performing a single fine-tuning step on cLang-8
with the off-the-shelf language models yields further accuracy improvements
over an already top-performing gT5 model for English.
Related papers
- Using Language Models to Disambiguate Lexical Choices in Translation [13.795280427753648]
In translation, a concept represented by a single word in a source language can have multiple variations in a target language.
We work with native speakers of nine languages to create DTAiLS, a dataset of 1,377 sentence pairs that exhibit cross-lingual concept variation when translating from English.
arXiv Detail & Related papers (2024-11-08T18:48:57Z) - ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction [6.220415006158471]
We introduce a new dataset for grammatical error correction tasks, named ChatLang-8.
ChatLang-8 consists of 1 million pairs featuring human-like grammatical errors.
We observe improved model performance when using ChatLang-8 instead of existing GEC datasets.
arXiv Detail & Related papers (2024-06-05T12:35:00Z) - mmT5: Modular Multilingual Pre-Training Solves Source Language
Hallucinations [54.42422445568523]
mmT5 is a modular multilingual sequence-to-sequence model.
It disentangles language-specific information from language-agnostic information.
Compared to mT5, mmT5 raises the rate of generating text in the correct language under zero-shot settings from 7% to 99%.
arXiv Detail & Related papers (2023-05-23T16:38:01Z) - QAmeleon: Multilingual QA with Only 5 Examples [71.80611036543633]
We show how to leverage pre-trained language models under a few-shot learning setting.
Our approach, QAmeleon, uses a PLM to automatically generate multilingual data upon which QA models are trained.
Prompt tuning the PLM for data synthesis with only five examples per language delivers accuracy superior to translation-based baselines.
arXiv Detail & Related papers (2022-11-15T16:14:39Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - From Good to Best: Two-Stage Training for Cross-lingual Machine Reading
Comprehension [51.953428342923885]
We develop a two-stage approach to enhance the model performance.
The first stage targets at recall: we design a hard-learning (HL) algorithm to maximize the likelihood that the top-k predictions contain the accurate answer.
The second stage focuses on precision: an answer-aware contrastive learning mechanism is developed to learn the fine difference between the accurate answer and other candidates.
arXiv Detail & Related papers (2021-12-09T07:31:15Z) - Continual Learning in Multilingual NMT via Language-Specific Embeddings [92.91823064720232]
It consists in replacing the shared vocabulary with a small language-specific vocabulary and fine-tuning the new embeddings on the new language's parallel data.
Because the parameters of the original model are not modified, its performance on the initial languages does not degrade.
arXiv Detail & Related papers (2021-10-20T10:38:57Z) - UNKs Everywhere: Adapting Multilingual Language Models to New Scripts [103.79021395138423]
Massively multilingual language models such as multilingual BERT (mBERT) and XLM-R offer state-of-the-art cross-lingual transfer performance on a range of NLP tasks.
Due to their limited capacity and large differences in pretraining data, there is a profound performance gap between resource-rich and resource-poor target languages.
We propose novel data-efficient methods that enable quick and effective adaptation of pretrained multilingual models to such low-resource languages and unseen scripts.
arXiv Detail & Related papers (2020-12-31T11:37:28Z) - MTOP: A Comprehensive Multilingual Task-Oriented Semantic Parsing
Benchmark [31.91964553419665]
We present a new multilingual dataset, called MTOP, comprising of 100k annotated utterances in 6 languages across 11 domains.
We achieve an average improvement of +6.3 points on Slot F1 for the two existing multilingual datasets, over best results reported in their experiments.
We demonstrate strong zero-shot performance using pre-trained models combined with automatic translation and alignment, and a proposed distant supervision method to reduce the noise in slot label projection.
arXiv Detail & Related papers (2020-08-21T07:02:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.