CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks
for Chinese Large Language Models
- URL: http://arxiv.org/abs/2312.12853v1
- Date: Wed, 20 Dec 2023 09:06:18 GMT
- Title: CORECODE: A Common Sense Annotated Dialogue Dataset with Benchmark Tasks
for Chinese Large Language Models
- Authors: Dan Shi, Chaobin You, Jiantao Huang, Taihao Li, Deyi Xiong
- Abstract summary: CORECODE is a dataset that contains abundant commonsense knowledge manually annotated on dyadic dialogues.
We categorize commonsense knowledge in everyday conversations into three dimensions: entity, event, and social interaction.
We collect 76,787 commonsense knowledge annotations from 19,700 dialogues through crowdsourcing.
- Score: 42.5532503036805
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As an indispensable ingredient of intelligence, commonsense reasoning is
crucial for large language models (LLMs) in real-world scenarios. In this
paper, we propose CORECODE, a dataset that contains abundant commonsense
knowledge manually annotated on dyadic dialogues, to evaluate the commonsense
reasoning and commonsense conflict detection capabilities of Chinese LLMs. We
categorize commonsense knowledge in everyday conversations into three
dimensions: entity, event, and social interaction. For easy and consistent
annotation, we standardize the form of commonsense knowledge annotation in
open-domain dialogues as "domain: slot = value". A total of 9 domains and 37
slots are defined to capture diverse commonsense knowledge. With these
pre-defined domains and slots, we collect 76,787 commonsense knowledge
annotations from 19,700 dialogues through crowdsourcing. To evaluate and
enhance the commonsense reasoning capability for LLMs on the curated dataset,
we establish a series of dialogue-level reasoning and detection tasks,
including commonsense knowledge filling, commonsense knowledge generation,
commonsense conflict phrase detection, domain identification, slot
identification, and event causal inference. A wide variety of existing
open-source Chinese LLMs are evaluated with these tasks on our dataset.
Experimental results demonstrate that these models are not competent to predict
CORECODE's plentiful reasoning content, and even ChatGPT could only achieve
0.275 and 0.084 accuracy on the domain identification and slot identification
tasks under the zero-shot setting. We release the data and codes of CORECODE at
https://github.com/danshi777/CORECODE to promote commonsense reasoning
evaluation and study of LLMs in the context of daily conversations.
Related papers
- What Really is Commonsense Knowledge? [58.5342212738895]
We survey existing definitions of commonsense knowledge, ground into the three frameworks for defining concepts, and consolidate them into a unified definition of commonsense knowledge.
We then use the consolidated definition for annotations and experiments on the CommonsenseQA and CommonsenseQA 2.0 datasets.
Our study shows that there exists a large portion of non-commonsense-knowledge instances in the two datasets, and a large performance gap on these two subsets.
arXiv Detail & Related papers (2024-11-06T14:54:19Z) - SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge [60.76719375410635]
We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos.
The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving.
We generate associated question-answer pairs and reasoning processes, finally followed by manual reviews for quality assurance.
arXiv Detail & Related papers (2024-05-15T21:55:31Z) - CGoDial: A Large-Scale Benchmark for Chinese Goal-oriented Dialog
Evaluation [75.60156479374416]
CGoDial is a new challenging and comprehensive Chinese benchmark for Goal-oriented Dialog evaluation.
It contains 96,763 dialog sessions and 574,949 dialog turns totally, covering three datasets with different knowledge sources.
To bridge the gap between academic benchmarks and spoken dialog scenarios, we either collect data from real conversations or add spoken features to existing datasets via crowd-sourcing.
arXiv Detail & Related papers (2022-11-21T16:21:41Z) - ComFact: A Benchmark for Linking Contextual Commonsense Knowledge [31.19689856957576]
We propose the new task of commonsense fact linking, where models are given contexts and trained to identify situationally-relevant commonsense knowledge from KGs.
Our novel benchmark, ComFact, contains 293k in-context relevance annotations for commonsense across four stylistically diverse datasets.
arXiv Detail & Related papers (2022-10-23T09:30:39Z) - Commonsense and Named Entity Aware Knowledge Grounded Dialogue
Generation [20.283091595536835]
We present a novel open-domain dialogue generation model which effectively utilizes the large-scale commonsense and named entity based knowledge.
Our proposed model utilizes a multi-hop attention layer to preserve the most accurate and critical parts of the dialogue history and the associated knowledge.
Empirical results on two benchmark dataset demonstrate that our model significantly outperforms the state-of-the-art methods in terms of both automatic evaluation metrics and human judgment.
arXiv Detail & Related papers (2022-05-27T12:11:40Z) - Multi-Sentence Knowledge Selection in Open-Domain Dialogue [11.936691632841388]
We evaluate the existing state of open-domain conversation knowledge selection.
We create an augmented dataset based on the Wizard of Wikipedia (WOW) corpus.
WOW++ averages 8 relevant knowledge sentences per dialogue context.
arXiv Detail & Related papers (2022-03-01T22:07:05Z) - Dimensions of Commonsense Knowledge [60.49243784752026]
We survey a wide range of popular commonsense sources with a special focus on their relations.
We consolidate these relations into 13 knowledge dimensions, each abstracting over more specific relations found in sources.
arXiv Detail & Related papers (2021-01-12T17:52:39Z) - Inferential Text Generation with Multiple Knowledge Sources and
Meta-Learning [117.23425857240679]
We study the problem of generating inferential texts of events for a variety of commonsense like textitif-else relations.
Existing approaches typically use limited evidence from training examples and learn for each relation individually.
In this work, we use multiple knowledge sources as fuels for the model.
arXiv Detail & Related papers (2020-04-07T01:49:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.