NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts
- URL: http://arxiv.org/abs/2305.16023v1
- Date: Thu, 25 May 2023 13:05:52 GMT
- Title: NaSGEC: a Multi-Domain Chinese Grammatical Error Correction Dataset from
Native Speaker Texts
- Authors: Yue Zhang, Bo Zhang, Haochen Jiang, Zhenghua Li, Chen Li, Fei Huang,
Min Zhang
- Abstract summary: We introduce NaSGEC, a new dataset to facilitate research on Chinese grammatical error correction (CGEC) for native speaker texts from multiple domains.
To broaden the target domain, we annotate multiple references for 12,500 sentences from three native domains, i.e., social media, scientific writing, and examination.
We provide solid benchmark results for NaSGEC by employing cutting-edge CGEC models and different training data.
- Score: 51.64770549988806
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce NaSGEC, a new dataset to facilitate research on Chinese
grammatical error correction (CGEC) for native speaker texts from multiple
domains. Previous CGEC research primarily focuses on correcting texts from a
single domain, especially learner essays. To broaden the target domain, we
annotate multiple references for 12,500 sentences from three native domains,
i.e., social media, scientific writing, and examination. We provide solid
benchmark results for NaSGEC by employing cutting-edge CGEC models and
different training data. We further perform detailed analyses of the
connections and gaps between our domains from both empirical and statistical
views. We hope this work can inspire future studies on an important but
under-explored direction--cross-domain GEC.
Related papers
- Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning [55.107329995417786]
Large language models (LLMs) have demonstrated impressive general understanding and generation abilities.
We establish a benchmark for multi-domain translation, featuring 25 German$Leftrightarrow$English and 22 Chinese$Leftrightarrow$English test sets.
We propose a domain Chain of Thought (CoT) fine-tuning technique that utilizes the intrinsic multi-domain intelligence of LLMs to improve translation performance.
arXiv Detail & Related papers (2024-10-03T16:15:04Z) - A Unified Data Augmentation Framework for Low-Resource Multi-Domain Dialogue Generation [52.0964459842176]
Current state-of-the-art dialogue systems heavily rely on extensive training datasets.
We propose a novel data textbfAugmentation framework for textbfMulti-textbfDomain textbfDialogue textbfGeneration, referred to as textbfAMD$2$G.
The AMD$2$G framework consists of a data augmentation process and a two-stage training approach: domain-agnostic training and domain adaptation training.
arXiv Detail & Related papers (2024-06-14T09:52:27Z) - Improving Retrieval Augmented Neural Machine Translation by Controlling
Source and Fuzzy-Match Interactions [15.845071122977158]
We build on the idea of Retrieval Augmented Translation (RAT) where top-k in-domain fuzzy matches are found for the source sentence.
We propose a novel architecture to control interactions between a source sentence and the top-k fuzzy target-language matches.
arXiv Detail & Related papers (2022-10-10T23:33:15Z) - Multi-CPR: A Multi Domain Chinese Dataset for Passage Retrieval [19.000263567641817]
We present a novel multi-domain Chinese dataset for passage retrieval (Multi-CPR)
The dataset is collected from three different domains, including E-commerce, Entertainment video and Medical.
We find that the performance of retrieval models trained on dataset from general domain will inevitably decrease on specific domain.
arXiv Detail & Related papers (2022-03-07T13:20:46Z) - Discover, Hallucinate, and Adapt: Open Compound Domain Adaptation for
Semantic Segmentation [91.30558794056056]
Unsupervised domain adaptation (UDA) for semantic segmentation has been attracting attention recently.
We present a novel framework based on three main design principles: discover, hallucinate, and adapt.
We evaluate our solution on standard benchmark GTA to C-driving, and achieved new state-of-the-art results.
arXiv Detail & Related papers (2021-10-08T13:20:09Z) - Can BERT Dig It? -- Named Entity Recognition for Information Retrieval
in the Archaeology Domain [3.928604516640069]
ArcheoBERTje is a BERT model pre-trained on Dutch archaeological texts.
We analyse the differences between the vocabulary and output of the BERT models on the full collection.
arXiv Detail & Related papers (2021-06-14T20:26:19Z) - Few-Shot Domain Adaptation for Grammatical Error Correction via
Meta-Learning [7.63233690743613]
Grammatical Error Correction (GEC) methods based on sequence-to-sequence mainly focus on how to generate more pseudo data to obtain better performance.
We treat different GEC domains as different GEC tasks and propose to extend meta-learning to few-shot GEC domain adaptation without using any pseudo data.
arXiv Detail & Related papers (2021-01-29T05:28:55Z) - FDMT: A Benchmark Dataset for Fine-grained Domain Adaptation in Machine
Translation [53.87731008029645]
We present a real-world fine-grained domain adaptation task in machine translation (FDMT)
The FDMT dataset consists of four sub-domains of information technology: autonomous vehicles, AI education, real-time networks and smart phone.
We make quantitative experiments and deep analyses in this new setting, which benchmarks the fine-grained domain adaptation task.
arXiv Detail & Related papers (2020-12-31T17:15:09Z) - Curriculum CycleGAN for Textual Sentiment Domain Adaptation with
Multiple Sources [68.31273535702256]
We propose a novel instance-level MDA framework, named curriculum cycle-consistent generative adversarial network (C-CycleGAN)
C-CycleGAN consists of three components: (1) pre-trained text encoder which encodes textual input from different domains into a continuous representation space, (2) intermediate domain generator with curriculum instance-level adaptation which bridges the gap across source and target domains, and (3) task classifier trained on the intermediate domain for final sentiment classification.
We conduct extensive experiments on three benchmark datasets and achieve substantial gains over state-of-the-art DA approaches.
arXiv Detail & Related papers (2020-11-17T14:50:55Z) - Grammatical Error Correction in Low Error Density Domains: A New
Benchmark and Analyses [17.57265480823457]
We release CWEB, a new benchmark for GEC consisting of website text generated by English speakers of varying levels of proficiency.
Website data is a common and important domain that contains far fewer grammatical errors than learner essays.
We demonstrate that a factor behind this is the inability of systems to rely on a strong internal language model in low error density domains.
arXiv Detail & Related papers (2020-10-15T07:52:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.