CC-Riddle: A Question Answering Dataset of Chinese Character Riddles
- URL: http://arxiv.org/abs/2206.13778v2
- Date: Sun, 24 Sep 2023 05:15:51 GMT
- Title: CC-Riddle: A Question Answering Dataset of Chinese Character Riddles
- Authors: Fan Xu and Yunxiang Zhang and Xiaojun Wan
- Abstract summary: The Chinese character riddle is a unique form of cultural entertainment specific to the Chinese language.
We construct a textbfChinese textbfCharacter riddle dataset named CC-Riddle.
- Score: 51.41044750575767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Chinese character riddle is a unique form of cultural entertainment
specific to the Chinese language. It typically comprises two parts: the riddle
description and the solution. The solution to the riddle is a single character,
while the riddle description primarily describes the glyph of the solution,
occasionally supplemented with its explanation and pronunciation. Solving
Chinese character riddles is a challenging task that demands understanding of
character glyph, general knowledge, and a grasp of figurative language. In this
paper, we construct a \textbf{C}hinese \textbf{C}haracter riddle dataset named
CC-Riddle, which covers the majority of common simplified Chinese characters.
The construction process is a combination of web crawling, language model
generation and manual filtering. In generation stage, we input the Chinese
phonetic alphabet, glyph and meaning of the solution character into the
generation model, which then produces multiple riddle descriptions. The
generated riddles are then manually filtered and the final CC-Riddle dataset is
composed of both human-written riddles and these filtered, generated riddles.
In order to assess the performance of language models on the task of solving
character riddles, we use retrieval-based, generative and multiple-choice QA
strategies to test three language models: BERT, ChatGPT and ChatGLM. The test
results reveal that current language models still struggle to solve Chinese
character riddles. CC-Riddle is publicly available at
\url{https://github.com/pku0xff/CC-Riddle}.
Related papers
- Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction [73.26364649572237]
Oracle Bone Inscriptions is one of the oldest existing forms of writing in the world.
A large number of Oracle Bone Inscriptions (OBI) remain undeciphered, making it one of the global challenges in paleography today.
This paper introduces a novel approach, namely Puzzle Pieces Picker (P$3$), to decipher these enigmatic characters through radical reconstruction.
arXiv Detail & Related papers (2024-06-05T07:34:39Z) - Shuo Wen Jie Zi: Rethinking Dictionaries and Glyphs for Chinese Language
Pre-training [50.100992353488174]
We introduce CDBERT, a new learning paradigm that enhances the semantics understanding ability of the Chinese PLMs with dictionary knowledge and structure of Chinese characters.
We name the two core modules of CDBERT as Shuowen and Jiezi, where Shuowen refers to the process of retrieving the most appropriate meaning from Chinese dictionaries.
Our paradigm demonstrates consistent improvements on previous Chinese PLMs across all tasks.
arXiv Detail & Related papers (2023-05-30T05:48:36Z) - Down and Across: Introducing Crossword-Solving as a New NLP Benchmark [11.194615436370507]
We release the specification of a corpus of crossword puzzles collected from the New York Times daily crossword spanning 25 years.
These puzzles include a diverse set of clues: historic, factual, word meaning, synonyms/antonyms, fill-in-the-blank, abbreviations, prefixes/suffixes, wordplay, and cross-lingual.
arXiv Detail & Related papers (2022-05-20T21:16:44Z) - "Is Whole Word Masking Always Better for Chinese BERT?": Probing on
Chinese Grammatical Error Correction [58.40808660657153]
We investigate whether whole word masking (WWM) leads to better context understanding ability for Chinese BERT.
We construct a dataset including labels for 19,075 tokens in 10,448 sentences.
We train three Chinese BERT models with standard character-level masking (CLM), WWM, and a combination of CLM and WWM, respectively.
arXiv Detail & Related papers (2022-03-01T08:24:56Z) - BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles [82.63394952538292]
We introduce BiRdQA, a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles.
Existing monolingual and multilingual QA models fail to perform well on our dataset, indicating that there is a long way to go before machine can beat human on solving tricky riddles.
arXiv Detail & Related papers (2021-09-23T00:46:47Z) - CCPM: A Chinese Classical Poetry Matching Dataset [50.90794811956129]
We propose a novel task to assess a model's semantic understanding of poetry by poem matching.
This task requires the model to select one line of Chinese classical poetry among four candidates according to the modern Chinese translation of a line of poetry.
To construct this dataset, we first obtain a set of parallel data of Chinese classical poetry and modern Chinese translation.
arXiv Detail & Related papers (2021-06-03T16:49:03Z) - Read, Listen, and See: Leveraging Multimodal Information Helps Chinese
Spell Checking [20.74049189959078]
We propose a Chinese spell checker called ReaLiSe, by directly leveraging the multimodal information of the Chinese characters.
The ReaLiSe tackles model the CSC task by (1) capturing the semantic, phonetic and graphic information of the input characters, and (2) mixing the information in these modalities to predict the correct output.
Experiments on the SIGHAN benchmarks show that the proposed model outperforms strong baselines by a large margin.
arXiv Detail & Related papers (2021-05-26T02:38:11Z) - RiddleSense: Answering Riddle Questions as Commonsense Reasoning [35.574564653690594]
RiddleSense is a novel multiple-choice question answering challenge for benchmarking higher-order commonsense reasoning models.
RiddleSense is the first large dataset for riddle-style commonsense question answering, where the distractors are crowdsourced from human annotators.
We systematically evaluate a wide range of reasoning models over it and point out that there is a large gap between the best-supervised model and human performance.
arXiv Detail & Related papers (2021-01-02T05:28:15Z) - CalliGAN: Style and Structure-aware Chinese Calligraphy Character
Generator [6.440233787863018]
Chinese calligraphy is the writing of Chinese characters as an art form performed with brushes.
Recent studies show that Chinese characters can be generated through image-to-image translation for multiple styles using a single model.
We propose a novel method of this approach by incorporating Chinese characters' component information into its model.
arXiv Detail & Related papers (2020-05-26T03:15:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.