Pchatbot: A Large-Scale Dataset for Personalized Chatbot
- URL: http://arxiv.org/abs/2009.13284v3
- Date: Mon, 31 May 2021 05:53:44 GMT
- Title: Pchatbot: A Large-Scale Dataset for Personalized Chatbot
- Authors: Hongjin Qian, Xiaohe Li, Hanxun Zhong, Yu Guo, Yueyuan Ma, Yutao Zhu,
Zhanliang Liu, Zhicheng Dou, Ji-Rong Wen
- Abstract summary: We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
- Score: 49.16746174238548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language dialogue systems raise great attention recently. As many
dialogue models are data-driven, high-quality datasets are essential to these
systems. In this paper, we introduce Pchatbot, a large-scale dialogue dataset
that contains two subsets collected from Weibo and Judicial forums
respectively. To adapt the raw dataset to dialogue systems, we elaborately
normalize the raw dataset via processes such as anonymization, deduplication,
segmentation, and filtering. The scale of Pchatbot is significantly larger than
existing Chinese datasets, which might benefit the data-driven models. Besides,
current dialogue datasets for personalized chatbot usually contain several
persona sentences or attributes. Different from existing datasets, Pchatbot
provides anonymized user IDs and timestamps for both posts and responses. This
enables the development of personalized dialogue models that directly learn
implicit user personality from the user's dialogue history. Our preliminary
experimental study benchmarks several state-of-the-art dialogue models to
provide a comparison for future work. The dataset can be publicly accessed at
Github.
Related papers
- PSYDIAL: Personality-based Synthetic Dialogue Generation using Large Language Models [4.283022729693451]
We present a novel end-to-end personality-based synthetic dialogue data generation pipeline, specifically designed to elicit responses from large language models via prompting.
We introduce PSYDIAL, the first Korean dialogue dataset focused on personality-based dialogues, curated using our proposed pipeline.
Experimental results indicate that while pre-trained models and those fine-tuned with a chit-chat dataset struggle to generate responses reflecting personality, models trained with PSYDIAL show significant improvements.
arXiv Detail & Related papers (2024-04-01T05:19:34Z) - PersonalityChat: Conversation Distillation for Personalized Dialog
Modeling with Facts and Traits [5.447308344436046]
PersonalityChat is a synthetic conversational dataset based upon the popular PersonaChat dataset.
We show that the personality trait labels can be used for trait-based personalization of generative dialogue models.
arXiv Detail & Related papers (2024-01-14T20:35:33Z) - SalesBot 2.0: A Human-Like Intent-Guided Chit-Chat Dataset [28.257630375747606]
This paper aims to build SalesBot 2.0, a revised version of the published data, by leveraging the commonsense knowledge of large language models (LLMs) through proper prompting.
The newly released large-scale dataset with detailed annotations exhibits smoother transitions between topics and is more human-like in terms of naturalness and consistency.
arXiv Detail & Related papers (2023-08-28T02:48:49Z) - SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented
Dialogue Agents [72.42049370297849]
SpokenWOZ is a large-scale speech-text dataset for spoken TOD.
Cross-turn slot and reasoning slot detection are new challenges for SpokenWOZ.
arXiv Detail & Related papers (2023-05-22T13:47:51Z) - A Model-Agnostic Data Manipulation Method for Persona-based Dialogue
Generation [107.82729587882397]
It is expensive to scale up current persona-based dialogue datasets.
Each data sample in this task is more complex to learn with than conventional dialogue data.
We propose a data manipulation method, which is model-agnostic to be packed with any persona-based dialogue generation model.
arXiv Detail & Related papers (2022-04-21T03:49:54Z) - Cross-Lingual Dialogue Dataset Creation via Outline-Based Generation [70.81596088969378]
Cross-lingual Outline-based Dialogue dataset (termed COD) enables natural language understanding.
COD enables dialogue state tracking, and end-to-end dialogue modelling and evaluation in 4 diverse languages.
arXiv Detail & Related papers (2022-01-31T18:11:21Z) - A Large-Scale Chinese Short-Text Conversation Dataset [77.55813366932313]
We present a large-scale cleaned Chinese conversation dataset, LCCC, which contains a base version (6.8million dialogues) and a large version (12.0 million dialogues)
The quality of our dataset is ensured by a rigorous data cleaning pipeline, which is built based on a set of rules.
We also release pre-training dialogue models which are trained on LCCC-base and LCCC-large respectively.
arXiv Detail & Related papers (2020-08-10T08:12:49Z) - XPersona: Evaluating Multilingual Personalized Chatbot [76.00426517401894]
We propose a multi-lingual extension of Persona-Chat, namely XPersona.
Our dataset includes persona conversations in six different languages other than English for building and evaluating multilingual personalized agents.
arXiv Detail & Related papers (2020-03-17T07:52:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.