The Gutenberg Dialogue Dataset
- URL: http://arxiv.org/abs/2004.12752v2
- Date: Fri, 22 Jan 2021 17:54:25 GMT
- Title: The Gutenberg Dialogue Dataset
- Authors: Richard Csaky and Gabor Recski
- Abstract summary: Current publicly available open-domain dialogue datasets offer a trade-off between quality and size.
We build a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian.
- Score: 1.90365714903665
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large datasets are essential for neural modeling of many NLP tasks. Current
publicly available open-domain dialogue datasets offer a trade-off between
quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap
by building a high-quality dataset of 14.8M utterances in English, and smaller
datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We
extract and process dialogues from public-domain books made available by
Project Gutenberg. We describe our dialogue extraction pipeline, analyze the
effects of the various heuristics used, and present an error analysis of
extracted dialogues. Finally, we conduct experiments showing that better
response quality can be achieved in zero-shot and finetuning settings by
training on our data than on the larger but much noisier Opensubtitles dataset.
Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can
be extended to further languages with little additional effort. Researchers can
also build their versions of existing datasets by adjusting various trade-off
parameters. We also built a web demo for interacting with our models:
https://ricsinaruto.github.io/chatbot.html.
Related papers
- Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources.
We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis.
We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z) - Deepfake audio as a data augmentation technique for training automatic
speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio.
A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - Controllable Dialogue Simulation with In-Context Learning [39.04491297557292]
textscDialogic is a dialogue simulation method based on large language model in-context learning.
Our method can rapidly expand a small set of dialogue data with minimum or zero human involvement.
Our simulated dialogues have near-human fluency and annotation accuracy.
arXiv Detail & Related papers (2022-10-09T06:32:58Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues [7.8378818005171125]
Given a large-scale dialogue data set in one language, we can automatically produce an effective semantic for other languages using machine translation.
We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values.
We show that the succinct representation reduces the compounding effect of translation errors.
arXiv Detail & Related papers (2021-11-04T01:08:14Z) - Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models.
We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks.
We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z) - A Large-Scale Chinese Short-Text Conversation Dataset [77.55813366932313]
We present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues)
The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules.
We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively.
arXiv Detail & Related papers (2020-08-10T08:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.