SODA: Million-scale Dialogue Distillation with Social Commonsense
Contextualization
- URL: http://arxiv.org/abs/2212.10465v3
- Date: Mon, 23 Oct 2023 18:46:21 GMT
- Title: SODA: Million-scale Dialogue Distillation with Social Commonsense
Contextualization
- Authors: Hyunwoo Kim, Jack Hessel, Liwei Jiang, Peter West, Ximing Lu, Youngjae
Yu, Pei Zhou, Ronan Le Bras, Malihe Alikhani, Gunhee Kim, Maarten Sap, Yejin
Choi
- Abstract summary: We present SODA, the first publicly available, million-scale high-quality social dialogue dataset.
By contextualizing social commonsense knowledge from a knowledge graph, we are able to distill an exceptionally broad spectrum of social interactions.
Human evaluation shows that conversations in SODA are more consistent, specific, and (surprisingly) natural than those in prior human-authored datasets.
- Score: 129.1927527781751
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Data scarcity has been a long standing issue in the field of open-domain
social dialogue. To quench this thirst, we present SODA: the first publicly
available, million-scale high-quality social dialogue dataset. By
contextualizing social commonsense knowledge from a knowledge graph, we are
able to distill an exceptionally broad spectrum of social interactions from a
large language model. Human evaluation shows that conversations in SODA are
more consistent, specific, and (surprisingly) natural than those in prior
human-authored datasets.
Using SODA, we train COSMO: a generalizable conversation model that is
significantly more natural and consistent on unseen datasets than
best-performing conversation models (e.g., GODEL, BlenderBot-1, Koala, Vicuna).
Experiments reveal COSMO is sometimes even preferred to the original
human-written gold responses. Additionally, our results shed light on the
distinction between knowledge-enriched conversations and natural social
chitchats. We plan to make our data, model, and code public.
Related papers
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations [8.03111197961603]
Building socialbots that can have deep, engaging open-domain conversations with humans is one of the grand challenges of artificial intelligence (AI)
We introduce Topical-Chat, a knowledge-grounded human-human conversation dataset where the underlying knowledge spans 8 broad topics and conversation partners don't have explicitly defined roles.
We also train several state-of-the-art encoder-decoder conversational models on Topical-Chat and perform automated and human evaluation for benchmarking.
arXiv Detail & Related papers (2023-08-23T08:33:14Z) - Does Collaborative Human-LM Dialogue Generation Help Information
Extraction from Human Dialogues? [55.28340832822234]
Problem-solving human dialogues in real applications can be much more complex than existing Wizard-of-Oz collections.
We introduce a human-in-the-loop dialogue generation framework capable of synthesizing realistic dialogues.
arXiv Detail & Related papers (2023-07-13T20:02:50Z) - SocialDial: A Benchmark for Socially-Aware Dialogue Systems [45.3266270265532]
We present the first socially-aware dialogue corpus - SocialDial, based on Chinese social culture.
SocialDial consists of two parts: 1,563 multi-turn dialogues between two human speakers with fine-grained labels, and 4,870 synthetic conversations generated by ChatGPT.
The human corpus covers five categories of social norms, which have 14 sub-categories in total.
arXiv Detail & Related papers (2023-04-24T11:55:22Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - Knowledge-Grounded Conversational Data Augmentation with Generative
Conversational Networks [76.11480953550013]
We take a step towards automatically generating conversational data using Generative Conversational Networks.
We evaluate our approach on conversations with and without knowledge on the Topical Chat dataset.
arXiv Detail & Related papers (2022-07-22T22:37:14Z) - AFEC: A Knowledge Graph Capturing Social Intelligence in Casual
Conversations [7.390960543869484]
This paper introduces AFEC, an automatically curated knowledge graph based on people's day-to-day casual conversations.
For this body of knowledge to be comprehensive and meaningful, we curated a large-scale corpus from the r/CasualConversation SubReddit.
arXiv Detail & Related papers (2022-05-22T15:19:12Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.