Ensembling and Knowledge Distilling of Large Sequence Taggers for
Grammatical Error Correction
- URL: http://arxiv.org/abs/2203.13064v1
- Date: Thu, 24 Mar 2022 13:18:36 GMT
- Title: Ensembling and Knowledge Distilling of Large Sequence Taggers for
Grammatical Error Correction
- Authors: Maksym Tarnavskyi, Artem Chernodub, Kostiantyn Omelianchuk
- Abstract summary: We investigate improvements to the GEC sequence tagging architecture with a focus on ensembling of cutting-edge Transformer-based encoders in Large configurations.
Our best ensemble achieves a new SOTA result with an $F_0.5$ score of 76.05 on BEA 2019 (test)
In addition, we perform knowledge distillation with a trained ensemble to generate new synthetic training datasets, "Troy-Blogs" and "Troy-1BW"
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we investigate improvements to the GEC sequence tagging
architecture with a focus on ensembling of recent cutting-edge
Transformer-based encoders in Large configurations. We encourage ensembling
models by majority votes on span-level edits because this approach is tolerant
to the model architecture and vocabulary size. Our best ensemble achieves a new
SOTA result with an $F_{0.5}$ score of 76.05 on BEA-2019 (test), even without
pre-training on synthetic datasets. In addition, we perform knowledge
distillation with a trained ensemble to generate new synthetic training
datasets, "Troy-Blogs" and "Troy-1BW". Our best single sequence tagging model
that is pretrained on the generated Troy-datasets in combination with the
publicly available synthetic PIE dataset achieves a near-SOTA (To the best of
our knowledge, our best single model gives way only to much heavier T5 model
result with an $F_{0.5}$ score of 73.21 on BEA-2019 (test). The code, datasets,
and trained models are publicly available).
Related papers
- Iceberg: Enhancing HLS Modeling with Synthetic Data [61.48659845413156]
Iceberg is a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations.<n>Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels.
arXiv Detail & Related papers (2025-07-14T05:48:09Z) - Scaling Laws of Synthetic Data for Language Models [132.67350443447611]
We introduce SynthLLM, a scalable framework that transforms pre-training corpora into diverse, high-quality synthetic datasets.
Our approach achieves this by automatically extracting and recombining high-level concepts across multiple documents using a graph algorithm.
arXiv Detail & Related papers (2025-03-25T11:07:12Z) - UnitCoder: Scalable Iterative Code Synthesis with Unit Test Guidance [65.01483640267885]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various tasks, yet code generation remains a major challenge.
We introduce UnitCoder, a systematic pipeline leveraging model-generated unit tests to guide and validate the code generation process.
Our work presents a scalable approach that leverages model-generated unit tests to guide the synthesis of high-quality code data from pre-training corpora.
arXiv Detail & Related papers (2025-02-17T05:37:02Z) - Generating Realistic Tabular Data with Large Language Models [49.03536886067729]
Large language models (LLM) have been used for diverse tasks, but do not capture the correct correlation between the features and the target variable.
We propose a LLM-based method with three important improvements to correctly capture the ground-truth feature-class correlation in the real data.
Our experiments show that our method significantly outperforms 10 SOTA baselines on 20 datasets in downstream tasks.
arXiv Detail & Related papers (2024-10-29T04:14:32Z) - Machine Unlearning using a Multi-GAN based Model [0.0]
This article presents a new machine unlearning approach that utilizes multiple Generative Adversarial Network (GAN) based models.
The proposed method comprises two phases: i) data reorganization in which synthetic data using the GAN model is introduced with inverted class labels of the forget datasets, and ii) fine-tuning the pre-trained model.
arXiv Detail & Related papers (2024-07-26T02:28:32Z) - A synthetic data approach for domain generalization of NLI models [13.840374911669167]
Natural Language Inference (NLI) remains an important benchmark task for LLMs.
We show that synthetic high-quality datasets can adapt NLI models for zero-shot use in downstream applications.
We show that models trained on this data have the best generalization to completely new downstream test settings.
arXiv Detail & Related papers (2024-02-19T18:55:16Z) - MosaicBERT: A Bidirectional Encoder Optimized for Fast Pretraining [10.421048804389343]
We introduce MosaicBERT, a BERT-style encoder architecture and training recipe that is empirically optimized for fast pretraining.
When pretrained from scratch on the C4 dataset, this base model achieves a downstream average GLUE (dev) score of 79.6 in 1.13 hours on 8 A100 80 GB GPUs at a cost of roughly $20.
This empirical speed up in pretraining enables researchers and engineers to pretrain custom BERT-style models at low cost instead of finetune on existing generic models.
arXiv Detail & Related papers (2023-12-29T06:05:19Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - TarGEN: Targeted Data Generation with Large Language Models [51.87504111286201]
TarGEN is a multi-step prompting strategy for generating high-quality synthetic datasets.
We augment TarGEN with a method known as self-correction empowering LLMs to rectify inaccurately labeled instances.
A comprehensive analysis of the synthetic dataset compared to the original dataset reveals similar or higher levels of dataset complexity and diversity.
arXiv Detail & Related papers (2023-10-27T03:32:17Z) - Generalizing Dataset Distillation via Deep Generative Prior [75.9031209877651]
We propose to distill an entire dataset's knowledge into a few synthetic images.
The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data.
We present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space.
arXiv Detail & Related papers (2023-05-02T17:59:31Z) - Pretraining Without Attention [114.99187017618408]
This work explores pretraining without attention by using recent advances in sequence routing based on state-space models (SSMs)
BiGS is able to match BERT pretraining accuracy on GLUE and can be extended to long-form pretraining of 4096 tokens without approximation.
arXiv Detail & Related papers (2022-12-20T18:50:08Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Machine Learning Techniques to Construct Patched Analog Ensembles for
Data Assimilation [0.0]
We study general and variational autoencoders for the machine learning component of cAnEnOI.
We propose using patching schemes to divide the global spatial domain into digestible chunks.
Testing this new algorithm on a 1D toy model, we find that larger patch sizes make it harder to train an accurate generative model.
arXiv Detail & Related papers (2021-02-27T20:47:27Z) - A Comparison of LSTM and BERT for Small Corpus [0.0]
Recent advancements in the NLP field showed that transfer learning helps with achieving state-of-the-art results for new tasks by tuning pre-trained models instead of starting from scratch.
In this paper we focus on a real-life scenario that scientists in academia and industry face frequently: given a small dataset, can we use a large pre-trained model like BERT and get better results than simple models?
Our experimental results show that bidirectional LSTM models can achieve significantly higher results than a BERT model for a small dataset and these simple models get trained in much less time than tuning the pre-trained counterparts.
arXiv Detail & Related papers (2020-09-11T14:01:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.