Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine
Reading Comprehension
- URL: http://arxiv.org/abs/2112.06494v2
- Date: Tue, 14 Dec 2021 04:25:40 GMT
- Title: Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine
Reading Comprehension
- Authors: Shusheng Xu, Yichen Liu, Xiaoyu Yi, Siyuan Zhou, Huizi Li and Yi Wu
- Abstract summary: Native Chinese Reader is a new machine reading comprehension dataset with particularly long articles in both modern and classical Chinese.
NCR is collected from the exam questions for the Chinese course in China's high schools, which are designed to evaluate the language proficiency of native Chinese youth.
- Score: 9.66226932673554
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present Native Chinese Reader (NCR), a new machine reading comprehension
(MRC) dataset with particularly long articles in both modern and classical
Chinese. NCR is collected from the exam questions for the Chinese course in
China's high schools, which are designed to evaluate the language proficiency
of native Chinese youth. Existing Chinese MRC datasets are either
domain-specific or focusing on short contexts of a few hundreds of characters
in modern Chinese only. By contrast, NCR contains 8390 documents with an
average length of 1024 characters covering a wide range of Chinese writing
styles, including modern articles, classical literature and classical poetry. A
total of 20477 questions on these documents also require strong reasoning
abilities and common sense to figure out the correct answers. We implemented
multiple baseline models using popular Chinese pre-trained models and
additionally launched an online competition using our dataset to examine the
limit of current methods. The best model achieves 59% test accuracy while human
evaluation shows an average accuracy of 79%, which indicates a significant
performance gap between current MRC models and native Chinese speakers. We
release the dataset at https://sites.google.com/view/native-chinese-reader/.
Related papers
- When Does Classical Chinese Help? Quantifying Cross-Lingual Transfer in Hanja and Kanbun [48.07219104902607]
We question the assumption of cross-lingual transferability from Classical Chinese to Hanja and Kanbun.
Our experiments show minimal impact of Classical Chinese datasets on language model performance for ancient Korean documents written in Hanja.
arXiv Detail & Related papers (2024-11-07T15:59:54Z) - CVLUE: A New Benchmark Dataset for Chinese Vision-Language Understanding Evaluation [49.41531871253317]
We present a new Chinese Vision- Language Understanding Evaluation benchmark dataset.
The selection of object categories and images is entirely driven by Chinese native speakers.
We find that fine-tuning on Chinese culture-related VL datasets effectively enhances VLMs' understanding of Chinese culture.
arXiv Detail & Related papers (2024-07-01T08:35:37Z) - Skywork: A More Open Bilingual Foundation Model [55.927396986873816]
We present Skywork-13B, a family of large language models (LLMs) trained on a corpus of over 3.2 trillion tokens drawn from both English and Chinese texts.
We show that our model not only excels on popular benchmarks, but also achieves emphstate of the art performance in Chinese language modeling on diverse domains.
arXiv Detail & Related papers (2023-10-30T08:31:47Z) - Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through
Image-IDS Aligning [61.34060587461462]
We propose a two-stage framework for Chinese Text Recognition (CTR)
We pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS)
This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character.
The learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition.
arXiv Detail & Related papers (2023-09-03T05:33:16Z) - Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese [55.95225353842118]
We construct a large-scale dataset of image-text pairs in Chinese, where most data are retrieved from publicly available datasets.
We develop 5 Chinese CLIP models of multiple sizes, spanning from 77 to 958 million parameters.
Our experiments demonstrate that Chinese CLIP can achieve the state-of-the-art performance on MUGE, Flickr30K-CN, and COCO-CN.
arXiv Detail & Related papers (2022-11-02T17:47:23Z) - CINO: A Chinese Minority Pre-trained Language Model [30.447739293695026]
We propose CINO (Chinese Minority Pre-trained Language Model), a multilingual pre-trained language model for Chinese minority languages.
It covers Standard Chinese, Cantonese, and six other Chinese minority languages.
arXiv Detail & Related papers (2022-02-28T06:02:06Z) - ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin
Information [32.70080326854314]
We propose ChineseBERT, which incorporates the glyph and pinyin information of Chinese characters into language model pretraining.
The proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps.
arXiv Detail & Related papers (2021-06-30T13:06:00Z) - Investigating Transfer Learning in Multilingual Pre-trained Language
Models through Chinese Natural Language Inference [11.096793445651313]
We investigate the cross-lingual transfer abilities of XLM-R for Chinese and English natural language inference (NLI)
To better understand linguistic transfer, we created 4 categories of challenge and adversarial tasks for Chinese.
We find that cross-lingual models trained on English NLI do transfer well across our Chinese tasks.
arXiv Detail & Related papers (2021-06-07T22:00:18Z) - Hippocampus-heuristic Character Recognition Network for Zero-shot
Learning [3.720802292070508]
This paper proposes a novel Hippocampus-heuristic Character Recognition Network (HCRN)
HCRN can recognize unseen Chinese characters (namely zero-shot learning) only by training part of radicals.
It can accurately predict about 16,330 unseen testing Chinese characters relied on only 500 trained Chinese characters.
arXiv Detail & Related papers (2021-04-06T01:57:20Z) - A Sentence Cloze Dataset for Chinese Machine Reading Comprehension [64.07894249743767]
We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
arXiv Detail & Related papers (2020-04-07T04:09:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.