Data Augmentation for Spoken Grammatical Error Correction
- URL: http://arxiv.org/abs/2507.19374v1
- Date: Fri, 25 Jul 2025 15:25:17 GMT
- Title: Data Augmentation for Spoken Grammatical Error Correction
- Authors: Penny Karanasou, Mengjie Qian, Stefano BannĂ², Mark J. F. Gales, Kate M. Knill,
- Abstract summary: We propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies.<n>Our experiments are conducted on the S&I Corpus, the first publicly available speech dataset with grammar error annotations.
- Score: 33.192165163181315
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While there exist strong benchmark datasets for grammatical error correction (GEC), high-quality annotated spoken datasets for Spoken GEC (SGEC) are still under-resourced. In this paper, we propose a fully automated method to generate audio-text pairs with grammatical errors and disfluencies. Moreover, we propose a series of objective metrics that can be used to evaluate the generated data and choose the more suitable dataset for SGEC. The goal is to generate an augmented dataset that maintains the textual and acoustic characteristics of the original data while providing new types of errors. This augmented dataset should augment and enrich the original corpus without altering the language assessment scores of the second language (L2) learners. We evaluate the use of the augmented corpus both for written GEC (the text part) and for SGEC (the audio-text pairs). Our experiments are conducted on the S\&I Corpus, the first publicly available speech dataset with grammar error annotations.
Related papers
- Enhanced Hybrid Transducer and Attention Encoder Decoder with Text Data [10.662138902171497]
A joint transducer and attention-based encoder decoder (TAED) model is proposed to leverage large amounts of text corpus and enhance ASR accuracy.<n>Experiments show J-TAED successfully integrates speech and linguistic information into one model, and reduce the WER by 5.8 12.8% on the Librispeech dataset.
arXiv Detail & Related papers (2025-06-23T21:51:39Z) - Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs)<n>We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences.<n>This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z) - Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation.<n>We introduce novel methodologies and datasets to overcome these challenges.<n>We propose MhBART, an encoder-decoder model designed to emulate human writing style.<n>We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z) - ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction [6.220415006158471]
We introduce a new dataset for grammatical error correction tasks, named ChatLang-8.
ChatLang-8 consists of 1 million pairs featuring human-like grammatical errors.
We observe improved model performance when using ChatLang-8 instead of existing GEC datasets.
arXiv Detail & Related papers (2024-06-05T12:35:00Z) - Towards End-to-End Spoken Grammatical Error Correction [33.116296120680296]
Spoken grammatical error correction (GEC) aims to supply feedback to L2 learners on their use of grammar when speaking.
This paper introduces an alternative "end-to-end" approach to spoken GEC, exploiting a speech recognition foundation model, Whisper.
arXiv Detail & Related papers (2023-11-09T17:49:02Z) - RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder
for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE)
It encodes the text corpus into a latent space, capturing current and future information from both source and target text.
Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z) - Speech collage: code-switched audio generation by collaging monolingual
corpora [50.356820349870986]
Speech Collage is a method that synthesizes CS data from monolingual corpora by splicing audio segments.
We investigate the impact of generated data on speech recognition in two scenarios.
arXiv Detail & Related papers (2023-09-27T14:17:53Z) - GECTurk: Grammatical Error Correction and Detection Dataset for Turkish [1.804922416527064]
Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners.
Synthetic data generation is a common practice to overcome the scarcity of such data.
We present a flexible and synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules.
arXiv Detail & Related papers (2023-09-20T14:25:44Z) - Code-Switching Text Generation and Injection in Mandarin-English ASR [57.57570417273262]
We investigate text generation and injection for improving the performance of an industry commonly-used streaming model, Transformer-Transducer (T-T)
We first propose a strategy to generate code-switching text data and then investigate injecting generated text into T-T model explicitly by Text-To-Speech (TTS) conversion or implicitly by tying speech and text latent spaces.
Experimental results on the T-T model trained with a dataset containing 1,800 hours of real Mandarin-English code-switched speech show that our approaches to inject generated code-switching text significantly boost the performance of T-T models.
arXiv Detail & Related papers (2023-03-20T09:13:27Z) - A Unified Strategy for Multilingual Grammatical Error Correction with
Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction.
Our approach creates diverse parallel GEC data without any language-specific operations.
It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z) - ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical
Error Correction [30.917993017459615]
We present a novel parallel grammatical error correction (GEC) dataset drawn from open-domain conversations.
This dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting.
To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model.
arXiv Detail & Related papers (2021-12-15T20:27:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.