Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule
- URL: http://arxiv.org/abs/2311.11813v1
- Date: Mon, 20 Nov 2023 14:50:12 GMT
- Title: Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule
- Authors: Andrey Bout, Alexander Podolskiy, Sergey Nikolenko, Irina
Piontkovskaya
- Abstract summary: We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
- Score: 55.08778142798106
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Progress in neural grammatical error correction (GEC) is hindered by the lack
of annotated training data. Sufficient amounts of high-quality manually
annotated data are not available, so recent research has relied on generating
synthetic data, pretraining on it, and then fine-tuning on real datasets;
performance gains have been achieved either by ensembling or by using huge
pretrained models such as XXL-T5 as the backbone. In this work, we explore an
orthogonal direction: how to use available data more efficiently. First, we
propose auxiliary tasks that exploit the alignment between the original and
corrected sentences, such as predicting a sequence of corrections. We formulate
each task as a sequence-to-sequence problem and perform multi-task training.
Second, we discover that the order of datasets used for training and even
individual instances within a dataset may have important effects on the final
performance, so we set out to find the best training schedule. Together, these
two ideas lead to significant improvements, producing results that improve
state of the art with much smaller models; in particular, we outperform the
best models based on T5-XXL (11B parameters) with a BART-based model (400M
parameters).
Related papers
- Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Investigating Pre-trained Language Models on Cross-Domain Datasets, a
Step Closer to General AI [0.8889304968879164]
We investigate the ability of pre-trained language models to generalize to different non-language tasks.
The four pre-trained models that we used, T5, BART, BERT, and GPT-2 achieve outstanding results.
arXiv Detail & Related papers (2023-06-21T11:55:17Z) - Iterative Loop Learning Combining Self-Training and Active Learning for
Domain Adaptive Semantic Segmentation [1.827510863075184]
Self-training and active learning have been proposed to alleviate this problem.
This paper proposes an iterative loop learning method combining Self-Training and Active Learning.
arXiv Detail & Related papers (2023-01-31T01:31:43Z) - Self-Distillation for Further Pre-training of Transformers [83.84227016847096]
We propose self-distillation as a regularization for a further pre-training stage.
We empirically validate the efficacy of self-distillation on a variety of benchmark datasets for image and text classification tasks.
arXiv Detail & Related papers (2022-09-30T02:25:12Z) - Curriculum-Based Self-Training Makes Better Few-Shot Learners for
Data-to-Text Generation [56.98033565736974]
We propose Curriculum-Based Self-Training (CBST) to leverage unlabeled data in a rearranged order determined by the difficulty of text generation.
Our method can outperform fine-tuning and task-adaptive pre-training methods, and achieve state-of-the-art performance in the few-shot setting of data-to-text generation.
arXiv Detail & Related papers (2022-06-06T16:11:58Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - GLISTER: Generalization based Data Subset Selection for Efficient and
Robust Learning [11.220278271829699]
We introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework.
We propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates.
We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms.
arXiv Detail & Related papers (2020-12-19T08:41:34Z) - Data Weighted Training Strategies for Grammatical Error Correction [8.370770440898454]
We show how to incorporate delta-log-perplexity, a type of example scoring, into a training schedule for Grammatical Error Correction (GEC)
Models trained on scored data achieve state-of-the-art results on common GEC test sets.
arXiv Detail & Related papers (2020-08-07T03:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.