A Sentence Cloze Dataset for Chinese Machine Reading Comprehension
- URL: http://arxiv.org/abs/2004.03116v2
- Date: Mon, 2 Nov 2020 06:41:34 GMT
- Title: A Sentence Cloze Dataset for Chinese Machine Reading Comprehension
- Authors: Yiming Cui, Ting Liu, Ziqing Yang, Zhipeng Chen, Wentao Ma, Wanxiang
Che, Shijin Wang, Guoping Hu
- Abstract summary: We propose a new task called Sentence Cloze-style Machine Reading (SC-MRC)
The proposed task aims to fill the right candidate sentence into the passage that has several blanks.
We built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the SC-MRC task.
- Score: 64.07894249743767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Owing to the continuous efforts by the Chinese NLP community, more and more
Chinese machine reading comprehension datasets become available. To add
diversity in this area, in this paper, we propose a new task called Sentence
Cloze-style Machine Reading Comprehension (SC-MRC). The proposed task aims to
fill the right candidate sentence into the passage that has several blanks. We
built a Chinese dataset called CMRC 2019 to evaluate the difficulty of the
SC-MRC task. Moreover, to add more difficulties, we also made fake candidates
that are similar to the correct ones, which requires the machine to judge their
correctness in the context. The proposed dataset contains over 100K blanks
(questions) within over 10K passages, which was originated from Chinese
narrative stories. To evaluate the dataset, we implement several baseline
systems based on the pre-trained models, and the results show that the
state-of-the-art model still underperforms human performance by a large margin.
We release the dataset and baseline system to further facilitate our community.
Resources available through https://github.com/ymcui/cmrc2019
Related papers
- MahaSQuAD: Bridging Linguistic Divides in Marathi Question-Answering [0.4194295877935868]
This research endeavors to bridge the gap of the absence of efficient QnA datasets in low-resource languages.
We introduce MahaSQuAD, the first-ever full SQuAD dataset for the Indic language Marathi, consisting of 118,516 training, 11,873 validation, and 11,803 test samples.
arXiv Detail & Related papers (2024-04-20T12:16:35Z) - Cross-lingual Contextualized Phrase Retrieval [63.80154430930898]
We propose a new task formulation of dense retrieval, cross-lingual contextualized phrase retrieval.
We train our Cross-lingual Contextualized Phrase Retriever (CCPR) using contrastive learning.
On the phrase retrieval task, CCPR surpasses baselines by a significant margin, achieving a top-1 accuracy that is at least 13 points higher.
arXiv Detail & Related papers (2024-03-25T14:46:51Z) - A Multiple Choices Reading Comprehension Corpus for Vietnamese Language
Education [2.5199066832791535]
ViMMRC 2.0 is an extension of the previous ViMMRC for the task of multiple-choice reading comprehension in Vietnamese Textbooks.
This dataset has 699 reading passages which are prose and poems, and 5,273 questions.
Our multi-stage models achieved 58.81% by Accuracy on the test set, which is 5.34% better than the highest BERTology models.
arXiv Detail & Related papers (2023-03-31T15:54:54Z) - Ensemble Transfer Learning for Multilingual Coreference Resolution [60.409789753164944]
A problem that frequently occurs when working with a non-English language is the scarcity of annotated training data.
We design a simple but effective ensemble-based framework that combines various transfer learning techniques.
We also propose a low-cost TL method that bootstraps coreference resolution models by utilizing Wikipedia anchor texts.
arXiv Detail & Related papers (2023-01-22T18:22:55Z) - Understanding Translationese in Cross-Lingual Summarization [106.69566000567598]
Cross-lingual summarization (MS) aims at generating a concise summary in a different target language.
To collect large-scale CLS data, existing datasets typically involve translation in their creation.
In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese.
arXiv Detail & Related papers (2022-12-14T13:41:49Z) - Wukong: 100 Million Large-scale Chinese Cross-modal Pre-training Dataset
and A Foundation Framework [99.38817546900405]
This paper presents a large-scale Chinese cross-modal dataset for benchmarking different multi-modal pre-training methods.
We release a Large-Scale Chinese Cross-modal dataset named Wukong, containing 100 million Chinese image-text pairs from the web.
arXiv Detail & Related papers (2022-02-14T14:37:15Z) - Sentence Extraction-Based Machine Reading Comprehension for Vietnamese [0.2446672595462589]
We introduce the UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in Vietnamese language.
The dataset consists of comprises 23.074 question-answers based on 5.109 passages of 174 Vietnamese articles from Wikipedia.
Our experiments show that the best machine model is XLM-R$_Large, which achieves an exact match (EM) score of 85.97% and an F1-score of 88.77% on our dataset.
arXiv Detail & Related papers (2021-05-19T10:22:27Z) - Blow the Dog Whistle: A Chinese Dataset for Cant Understanding with
Common Sense and World Knowledge [49.288196234823005]
Cant is important for understanding advertising, comedies and dog-whistle politics.
We propose a large and diverse Chinese dataset for creating and understanding cant.
arXiv Detail & Related papers (2021-04-06T17:55:43Z) - A Vietnamese Dataset for Evaluating Machine Reading Comprehension [2.7528170226206443]
We present UIT-ViQuAD, a new dataset for the low-resource language as Vietnamese to evaluate machine reading comprehension models.
This dataset comprises over 23,000 human-generated question-answer pairs based on 5,109 passages of 174 Vietnamese articles from Wikipedia.
We conduct experiments on state-of-the-art MRC methods for English and Chinese as the first experimental models on UIT-ViQuAD.
arXiv Detail & Related papers (2020-09-30T15:06:56Z) - Enhancing lexical-based approach with external knowledge for Vietnamese
multiple-choice machine reading comprehension [2.5199066832791535]
We construct a dataset which consists of 2,783 pairs of multiple-choice questions and answers based on 417 Vietnamese texts.
We propose a lexical-based MRC method that utilizes semantic similarity measures and external knowledge sources to analyze questions and extract answers from the given text.
Our proposed method achieves 61.81% by accuracy, which is 5.51% higher than the best baseline model.
arXiv Detail & Related papers (2020-01-16T08:09:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.