Related papers: ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction

URL: http://arxiv.org/abs/2406.03202v2
Date: Tue, 11 Jun 2024 07:06:34 GMT
Title: ChatLang-8: An LLM-Based Synthetic Data Generation Framework for Grammatical Error Correction
Authors: Jeiyoon Park, Chanjun Park, Heuiseok Lim,
Abstract summary: We introduce a new dataset for grammatical error correction tasks, named ChatLang-8. ChatLang-8 consists of 1 million pairs featuring human-like grammatical errors. We observe improved model performance when using ChatLang-8 instead of existing GEC datasets.
Score: 6.220415006158471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We explore and improve the capabilities of LLMs to generate data for grammatical error correction (GEC). When merely producing parallel sentences, their patterns are too simplistic to be valuable as a corpus. To address this issue, we propose an automated framework that includes a Subject Selector, Grammar Selector, Prompt Manager, and Evaluator. Additionally, we introduce a new dataset for GEC tasks, named ChatLang-8, which encompasses eight types of subject nouns and 23 types of grammar. It consists of 1 million pairs featuring human-like grammatical errors. Our experiments reveal that ChatLang-8 exhibits a more uniform pattern composition compared to existing GEC datasets. Furthermore, we observe improved model performance when using ChatLang-8 instead of existing GEC datasets. The experimental results suggest that our framework and ChatLang-8 are valuable resources for enhancing ChatGPT's data generation capabilities.

Related papers

Tgea: An error-annotated dataset and benchmark tasks for text generation from pretrained language models [57.758735361535486]
TGEA is an error-annotated dataset for text generation from pretrained language models (PLMs) We create an error taxonomy to cover 24 types of errors occurring in PLM-generated sentences. This is the first dataset with comprehensive annotations for PLM-generated texts.
arXiv Detail & Related papers (2025-03-06T09:14:02Z)
Idiosyncrasies in Large Language Models [54.26923012617675]
We unveil and study idiosyncrasies in Large Language Models (LLMs) We find that fine-tuning existing text embedding models on LLM-generated texts yields excellent classification accuracy. We leverage LLM as judges to generate detailed, open-ended descriptions of each model's idiosyncrasies.
arXiv Detail & Related papers (2025-02-17T18:59:02Z)
Detecting Document-level Paraphrased Machine Generated Content: Mimicking Human Writing Style and Involving Discourse Features [57.34477506004105]
Machine-generated content poses challenges such as academic plagiarism and the spread of misinformation. We introduce novel methodologies and datasets to overcome these challenges. We propose MhBART, an encoder-decoder model designed to emulate human writing style. We also propose DTransformer, a model that integrates discourse analysis through PDTB preprocessing to encode structural features.
arXiv Detail & Related papers (2024-12-17T08:47:41Z)
Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training. We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios. Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z)
LLM-based Code-Switched Text Generation for Grammatical Error Correction [3.4457319208816224]
This work explores the complexities of applying Grammatical Error Correction systems to code-switching (CSW) texts. We evaluate state-of-the-art GEC systems on an authentic CSW dataset from English as a Second Language learners. We develop a model capable of correcting grammatical errors in monolingual and CSW texts.
arXiv Detail & Related papers (2024-10-14T10:07:29Z)
Exploiting Contextual Target Attributes for Target Sentiment Classification [53.30511968323911]
Existing PTLM-based models for TSC can be categorized into two groups: 1) fine-tuning-based models that adopt PTLM as the context encoder; 2) prompting-based models that transfer the classification task to the text/word generation task. We present a new perspective of leveraging PTLM for TSC: simultaneously leveraging the merits of both language modeling and explicit target-context interactions via contextual target attributes.
arXiv Detail & Related papers (2023-12-21T11:45:28Z)
RegaVAE: A Retrieval-Augmented Gaussian Mixture Variational Auto-Encoder for Language Modeling [79.56442336234221]
We introduce RegaVAE, a retrieval-augmented language model built upon the variational auto-encoder (VAE) It encodes the text corpus into a latent space, capturing current and future information from both source and target text. Experimental results on various datasets demonstrate significant improvements in text generation quality and hallucination removal.
arXiv Detail & Related papers (2023-10-16T16:42:01Z)
Advancements in Arabic Grammatical Error Detection and Correction: An Empirical Investigation [12.15509670220182]
Grammatical error correction (GEC) is a well-explored problem in English. Research on GEC in morphologically rich languages has been limited due to challenges such as data scarcity and language complexity. We present the first results on Arabic GEC using two newly developed Transformer-based pretrained sequence-to-sequence models.
arXiv Detail & Related papers (2023-05-24T05:12:58Z)
BenchCLAMP: A Benchmark for Evaluating Language Models on Syntactic and Semantic Parsing [55.058258437125524]
We introduce BenchCLAMP, a Benchmark to evaluate Constrained LAnguage Model Parsing. We benchmark eight language models, including two GPT-3 variants available only through an API. Our experiments show that encoder-decoder pretrained language models can achieve similar performance or surpass state-of-the-art methods for syntactic and semantic parsing when the model output is constrained to be valid.
arXiv Detail & Related papers (2022-06-21T18:34:11Z)
A Unified Strategy for Multilingual Grammatical Error Correction with Pre-trained Cross-Lingual Language Model [100.67378875773495]
We propose a generic and language-independent strategy for multilingual Grammatical Error Correction. Our approach creates diverse parallel GEC data without any language-specific operations. It achieves the state-of-the-art results on the NLPCC 2018 Task 2 dataset (Chinese) and obtains competitive performance on Falko-Merlin (German) and RULEC-GEC (Russian)
arXiv Detail & Related papers (2022-01-26T02:10:32Z)
ErAConD : Error Annotated Conversational Dialog Dataset for Grammatical Error Correction [30.917993017459615]
We present a novel parallel grammatical error correction (GEC) dataset drawn from open-domain conversations. This dataset is, to our knowledge, the first GEC dataset targeted to a conversational setting. To demonstrate the utility of the dataset, we use our annotated data to fine-tune a state-of-the-art GEC model.
arXiv Detail & Related papers (2021-12-15T20:27:40Z)
A Syntax-Guided Grammatical Error Correction Model with Dependency Tree Correction [83.14159143179269]
Grammatical Error Correction (GEC) is a task of detecting and correcting grammatical errors in sentences. We propose a syntax-guided GEC model (SG-GEC) which adopts the graph attention mechanism to utilize the syntactic knowledge of dependency trees. We evaluate our model on public benchmarks of GEC task and it achieves competitive results.
arXiv Detail & Related papers (2021-11-05T07:07:48Z)
A Simple Recipe for Multilingual Grammatical Error Correction [6.262434757334487]
This paper presents a recipe to train state-of-the-art multilingual Grammatical Error Correction (GEC) models. We first propose a language-agnostic method to generate a large number of synthetic examples. The second ingredient is to use large-scale multilingual language models.
arXiv Detail & Related papers (2021-06-07T17:47:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.