GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
- URL: http://arxiv.org/abs/2309.11346v1
- Date: Wed, 20 Sep 2023 14:25:44 GMT
- Title: GECTurk: Grammatical Error Correction and Detection Dataset for Turkish
- Authors: Atakan Kara, Farrin Marouf Sofian, Andrew Bond and G\"ozde G\"ul
\c{S}ahin
- Abstract summary: Grammatical Error Detection and Correction (GEC) tools have proven useful for native speakers and second language learners.
Synthetic data generation is a common practice to overcome the scarcity of such data.
We present a flexible and synthetic data generation pipeline for Turkish covering more than 20 expert-curated grammar and spelling rules.
- Score: 1.804922416527064
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Grammatical Error Detection and Correction (GEC) tools have proven useful for
native speakers and second language learners. Developing such tools requires a
large amount of parallel, annotated data, which is unavailable for most
languages. Synthetic data generation is a common practice to overcome the
scarcity of such data. However, it is not straightforward for morphologically
rich languages like Turkish due to complex writing rules that require
phonological, morphological, and syntactic information. In this work, we
present a flexible and extensible synthetic data generation pipeline for
Turkish covering more than 20 expert-curated grammar and spelling rules
(a.k.a., writing rules) implemented through complex transformation functions.
Using this pipeline, we derive 130,000 high-quality parallel sentences from
professionally edited articles. Additionally, we create a more realistic test
set by manually annotating a set of movie reviews. We implement three baselines
formulating the task as i) neural machine translation, ii) sequence tagging,
and iii) prefix tuning with a pretrained decoder-only model, achieving strong
results. Furthermore, we perform exhaustive experiments on out-of-domain
datasets to gain insights on the transferability and robustness of the proposed
approaches. Our results suggest that our corpus, GECTurk, is high-quality and
allows knowledge transfer for the out-of-domain setting. To encourage further
research on Turkish GEC, we release our datasets, baseline models, and the
synthetic data generation pipeline at https://github.com/GGLAB-KU/gecturk.
Related papers
- Organic Data-Driven Approach for Turkish Grammatical Error Correction and LLMs [0.0]
We introduce a new organic data-driven approach, clean insertions, to build parallel Turkish Grammatical Error Correction datasets.
We achieve state-of-the-art results on two Turkish Grammatical Error Correction test sets out of the three publicly available ones.
arXiv Detail & Related papers (2024-05-24T08:00:24Z) - Pipeline and Dataset Generation for Automated Fact-checking in Almost
Any Language [0.0]
This article presents a pipeline for automated fact-checking leveraging publicly available Language Models and data.
The pipeline consists of two main modules -- the evidence retrieval and the claim veracity evaluation.
We provide open access to all data and fine-tuned models for Czech, English, Polish, and Slovak pipelines.
arXiv Detail & Related papers (2023-12-15T19:43:41Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Neural Pipeline for Zero-Shot Data-to-Text Generation [3.42658286826597]
We propose to generate text by transforming single-item descriptions with a sequence of modules trained on general-domain text-based operations.
Our experiments on two major triple-to-text datasets -- WebNLG and E2E -- show that our approach enables D2T generation from RDF triples in zero-shot settings.
arXiv Detail & Related papers (2022-03-30T13:14:35Z) - From Universal Language Model to Downstream Task: Improving
RoBERTa-Based Vietnamese Hate Speech Detection [8.602181445598776]
We propose a pipeline to adapt the general-purpose RoBERTa language model to a specific text classification task: Vietnamese Hate Speech Detection.
Our experiments proved that our proposed pipeline boosts the performance significantly, achieving a new state-of-the-art on Vietnamese Hate Speech Detection campaign with 0.7221 F1 score.
arXiv Detail & Related papers (2021-02-24T09:30:55Z) - GATE: Graph Attention Transformer Encoder for Cross-lingual Relation and
Event Extraction [107.8262586956778]
We introduce graph convolutional networks (GCNs) with universal dependency parses to learn language-agnostic sentence representations.
GCNs struggle to model words with long-range dependencies or are not directly connected in the dependency tree.
We propose to utilize the self-attention mechanism to learn the dependencies between words with different syntactic distances.
arXiv Detail & Related papers (2020-10-06T20:30:35Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z) - Consecutive Decoding for Speech-to-text Translation [51.155661276936044]
COnSecutive Transcription and Translation (COSTT) is an integral approach for speech-to-text translation.
The key idea is to generate source transcript and target translation text with a single decoder.
Our method is verified on three mainstream datasets.
arXiv Detail & Related papers (2020-09-21T10:10:45Z) - POINTER: Constrained Progressive Text Generation via Insertion-based
Generative Pre-training [93.79766670391618]
We present POINTER, a novel insertion-based approach for hard-constrained text generation.
The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner.
The resulting coarse-to-fine hierarchy makes the generation process intuitive and interpretable.
arXiv Detail & Related papers (2020-05-01T18:11:54Z) - Have Your Text and Use It Too! End-to-End Neural Data-to-Text Generation
with Semantic Fidelity [3.8673630752805432]
We present DataTuner, a neural, end-to-end data-to-text generation system.
We take a two-stage generation-reranking approach, combining a fine-tuned language model with a semantic fidelity.
We show that DataTuner achieves state of the art results on the automated metrics across four major D2T datasets.
arXiv Detail & Related papers (2020-04-08T11:16:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.