Related papers: The Gutenberg Dialogue Dataset

The Gutenberg Dialogue Dataset

URL: http://arxiv.org/abs/2004.12752v2
Date: Fri, 22 Jan 2021 17:54:25 GMT
Title: The Gutenberg Dialogue Dataset
Authors: Richard Csaky and Gabor Recski
Abstract summary: Current publicly available open-domain dialogue datasets offer a trade-off between quality and size. We build a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian.
Score: 1.90365714903665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large datasets are essential for neural modeling of many NLP tasks. Current publicly available open-domain dialogue datasets offer a trade-off between quality (e.g., DailyDialog) and size (e.g., Opensubtitles). We narrow this gap by building a high-quality dataset of 14.8M utterances in English, and smaller datasets in German, Dutch, Spanish, Portuguese, Italian, and Hungarian. We extract and process dialogues from public-domain books made available by Project Gutenberg. We describe our dialogue extraction pipeline, analyze the effects of the various heuristics used, and present an error analysis of extracted dialogues. Finally, we conduct experiments showing that better response quality can be achieved in zero-shot and finetuning settings by training on our data than on the larger but much noisier Opensubtitles dataset. Our open-source pipeline (https://github.com/ricsinaruto/gutenberg-dialog) can be extended to further languages with little additional effort. Researchers can also build their versions of existing datasets by adjusting various trade-off parameters. We also built a web demo for interacting with our models: https://ricsinaruto.github.io/chatbot.html.

Related papers

Towards Zero-Shot Text-To-Speech for Arabic Dialects [16.10882912169842]
Zero-shot multi-speaker text-to-speech (ZS-TTS) systems have advanced for English, however, it still lags behind due to insufficient resources. We address this gap for Arabic by first adapting an existing dataset to suit the needs of speech synthesis. We employ a set of Arabic dialect identification models to explore the impact of pre-defined dialect labels on improving the ZS-TTS model in a multi-dialect setting.
arXiv Detail & Related papers (2024-06-24T15:58:15Z)
Deepfake audio as a data augmentation technique for training automatic speech to text transcription models [55.2480439325792]
We propose a framework that approaches data augmentation based on deepfake audio. A dataset produced by Indians (in English) was selected, ensuring the presence of a single accent.
arXiv Detail & Related papers (2023-09-22T11:33:03Z)
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD. Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z)
Controllable Dialogue Simulation with In-Context Learning [39.04491297557292]
textscDialogic is a dialogue simulation method based on large language model in-context learning. Our method can rapidly expand a small set of dialogue data with minimum or zero human involvement. Our simulated dialogues have near-human fluency and annotation accuracy.
arXiv Detail & Related papers (2022-10-09T06:32:58Z)
What Did You Say? Task-Oriented Dialog Datasets Are Not Conversational!? [4.022057598291766]
We outline a taxonomy of conversational and contextual effects, which we use to examine MultiWOZ, SGD and SMCalFlow. We find that less than 4% of MultiWOZ's turns and 10% of SGD's turns are conversational, while SMCalFlow is not conversational at all in its current release.
arXiv Detail & Related papers (2022-03-07T14:26:23Z)
Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding. COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z)
Contextual Semantic Parsing for Multilingual Task-Oriented Dialogues [7.8378818005171125]
Given a large-scale dialogue data set in one language, we can automatically produce an effective semantic for other languages using machine translation. We propose automatic translation of dialogue datasets with alignment to ensure faithful translation of slot values. We show that the succinct representation reduces the compounding effect of translation errors.
arXiv Detail & Related papers (2021-11-04T01:08:14Z)
Cross-lingual Intermediate Fine-tuning improves Dialogue State Tracking [84.50302759362698]
We enhance the transfer learning process by intermediate fine-tuning of pretrained multilingual models. We use parallel and conversational movie subtitles datasets to design cross-lingual intermediate tasks. We achieve impressive improvements (> 20% on goal accuracy) on the parallel MultiWoZ dataset and Multilingual WoZ dataset.
arXiv Detail & Related papers (2021-09-28T11:22:38Z)
Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively. To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization. The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
A Large-Scale Chinese Short-Text Conversation Dataset [77.55813366932313]
We present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues) The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules. We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively.
arXiv Detail & Related papers (2020-08-10T08:12:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.