YACLC: A Chinese Learner Corpus with Multidimensional Annotation
- URL: http://arxiv.org/abs/2112.15043v1
- Date: Thu, 30 Dec 2021 13:07:08 GMT
- Title: YACLC: A Chinese Learner Corpus with Multidimensional Annotation
- Authors: Yingying Wang, Cunliang Kong, Liner Yang, Yijun Wang, Xiaorong Lu,
Renfen Hu, Shan He, Zhenghao Liu, Yun Chen, Erhong Yang, Maosong Sun
- Abstract summary: We construct a large-scale, multidimensional annotated Chinese learner corpus.
By analyzing the original sentences and annotations in the corpus, we found that YACLC has a considerable size and very high annotation quality.
- Score: 45.304130762057945
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learner corpus collects language data produced by L2 learners, that is second
or foreign-language learners. This resource is of great relevance for second
language acquisition research, foreign-language teaching, and automatic
grammatical error correction. However, there is little focus on learner corpus
for Chinese as Foreign Language (CFL) learners. Therefore, we propose to
construct a large-scale, multidimensional annotated Chinese learner corpus. To
construct the corpus, we first obtain a large number of topic-rich texts
generated by CFL learners. Then we design an annotation scheme including a
sentence acceptability score as well as grammatical error and fluency-based
corrections. We build a crowdsourcing platform to perform the annotation
effectively (https://yaclc.wenmind.net). We name the corpus YACLC (Yet Another
Chinese Learner Corpus) and release it as part of the CUGE benchmark
(http://cuge.baai.ac.cn). By analyzing the original sentences and annotations
in the corpus, we found that YACLC has a considerable size and very high
annotation quality. We hope this corpus can further enhance the studies on
Chinese International Education and Chinese automatic grammatical error
correction.
Related papers
- Decoupled Vocabulary Learning Enables Zero-Shot Translation from Unseen Languages [55.157295899188476]
neural machine translation systems learn to map sentences of different languages into a common representation space.
In this work, we test this hypothesis by zero-shot translating from unseen languages.
We demonstrate that this setup enables zero-shot translation from entirely unseen languages.
arXiv Detail & Related papers (2024-08-05T07:58:58Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - A Corpus for Sentence-level Subjectivity Detection on English News Articles [49.49218203204942]
We use our guidelines to collect NewsSD-ENG, a corpus of 638 objective and 411 subjective sentences extracted from English news articles on controversial topics.
Our corpus paves the way for subjectivity detection in English without relying on language-specific tools, such as lexicons or machine translation.
arXiv Detail & Related papers (2023-05-29T11:54:50Z) - Romanization-based Large-scale Adaptation of Multilingual Language
Models [124.57923286144515]
Large multilingual pretrained language models (mPLMs) have become the de facto state of the art for cross-lingual transfer in NLP.
We study and compare a plethora of data- and parameter-efficient strategies for adapting the mPLMs to romanized and non-romanized corpora of 14 diverse low-resource languages.
Our results reveal that UROMAN-based transliteration can offer strong performance for many languages, with particular gains achieved in the most challenging setups.
arXiv Detail & Related papers (2023-04-18T09:58:34Z) - CLSE: Corpus of Linguistically Significant Entities [58.29901964387952]
We release a Corpus of Linguistically Significant Entities (CLSE) annotated by experts.
CLSE covers 74 different semantic types to support various applications from airline ticketing to video games.
We create a linguistically representative NLG evaluation benchmark in three languages: French, Marathi, and Russian.
arXiv Detail & Related papers (2022-11-04T12:56:12Z) - Multilingual Coreference Resolution with Harmonized Annotations [0.0]
We present coreference resolution experiments with a newly created multilingual corpus CorefUD.
We focus on the following languages: Czech, Russian, Polish, German, Spanish, and Catalan.
We combine the training data in multilingual experiments and train two joined models -- for Slavic languages and for all the languages together.
arXiv Detail & Related papers (2021-07-26T10:11:06Z) - Kosp2e: Korean Speech to English Translation Corpus [11.44330742875498]
We introduce kosp2e, a corpus that allows Korean speech to be translated into English text in an end-to-end manner.
We adopt open license speech recognition corpus, translation corpus, and spoken language corpora to make our dataset freely available to the public.
arXiv Detail & Related papers (2021-07-06T20:34:06Z) - UA-GEC: Grammatical Error Correction and Fluency Corpus for the
Ukrainian Language [0.0]
This is the first grammatical error correction corpus for the Ukrainian language.
Professional proofreaders corrected and annotated the corpus for errors relating to fluency, grammar, punctuation, and spelling.
This corpus can be used for developing and evaluating GEC systems in Ukrainian.
arXiv Detail & Related papers (2021-03-31T11:18:36Z) - CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language
Model [15.469228003507919]
We introduce the Chinese corpus from CLUE organization, CLUECorpus 2020.
It has 100G raw corpus with 35 billion Chinese characters, which is retrieved from Common Crawl.
We release a new Chinese vocabulary with a size of 8K, which is only one-third of the vocabulary size used in Chinese Bert released by Google.
arXiv Detail & Related papers (2020-03-03T06:39:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.