Faithful Persona-based Conversational Dataset Generation with Large
Language Models
- URL: http://arxiv.org/abs/2312.10007v1
- Date: Fri, 15 Dec 2023 18:23:50 GMT
- Title: Faithful Persona-based Conversational Dataset Generation with Large
Language Models
- Authors: Pegah Jandaghi, XiangHai Sheng, Xinyi Bai, Jay Pujara, Hakim Sidahmed
- Abstract summary: High-quality conversational datasets are essential for developing AI models that can communicate with users.
We propose a Generator-Critic architecture framework to expand the initial dataset, while improving the quality of its conversations.
We release Synthetic-Persona-Chat, consisting of 20k conversations seeded from Persona-Chat.
- Score: 10.506653172302222
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality conversational datasets are essential for developing AI models
that can communicate with users. One way to foster deeper interactions between
a chatbot and its user is through personas, aspects of the user's character
that provide insights into their personality, motivations, and behaviors.
Training Natural Language Processing (NLP) models on a diverse and
comprehensive persona-based dataset can lead to conversational models that
create a deeper connection with the user, and maintain their engagement. In
this paper, we leverage the power of Large Language Models (LLMs) to create a
large, high-quality conversational dataset from a seed dataset. We propose a
Generator-Critic architecture framework to expand the initial dataset, while
improving the quality of its conversations. The Generator is an LLM prompted to
output conversations. The Critic consists of a mixture of expert LLMs that
control the quality of the generated conversations. These experts select the
best generated conversations, which we then use to improve the Generator. We
release Synthetic-Persona-Chat, consisting of 20k conversations seeded from
Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our
generation framework on different dimensions through extensive experiments, and
observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat
during Turing test decreases from 17.2% to 8.8% over three iterations.
Related papers
- Self-Directed Turing Test for Large Language Models [56.64615470513102]
The Turing test examines whether AIs can exhibit human-like behaviour in natural language conversations.
Traditional Turing tests adopt a rigid dialogue format where each participant sends only one message each time.
This paper proposes the Self-Directed Turing Test, which extends the original test with a burst dialogue format.
arXiv Detail & Related papers (2024-08-19T09:57:28Z) - PersonalityChat: Conversation Distillation for Personalized Dialog
Modeling with Facts and Traits [5.447308344436046]
PersonalityChat is a synthetic conversational dataset based upon the popular PersonaChat dataset.
We show that the personality trait labels can be used for trait-based personalization of generative dialogue models.
arXiv Detail & Related papers (2024-01-14T20:35:33Z) - AutoConv: Automatically Generating Information-seeking Conversations
with Large Language Models [74.10293412011455]
We propose AutoConv for synthetic conversation generation.
Specifically, we formulate the conversation generation problem as a language modeling task.
We finetune an LLM with a few human conversations to capture the characteristics of the information-seeking process.
arXiv Detail & Related papers (2023-08-12T08:52:40Z) - Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat.
Our objective is to capture the breadth of interactions that a human might have with an AI assistant.
We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z) - PLACES: Prompting Language Models for Social Conversation Synthesis [103.94325597273316]
We use a small set of expert-written conversations as in-context examples to synthesize a social conversation dataset using prompting.
We perform several thorough evaluations of our synthetic conversations compared to human-collected conversations.
arXiv Detail & Related papers (2023-02-07T05:48:16Z) - Training Conversational Agents with Generative Conversational Networks [74.9941330874663]
We use Generative Conversational Networks to automatically generate data and train social conversational agents.
We evaluate our approach on TopicalChat with automatic metrics and human evaluators, showing that with 10% of seed data it performs close to the baseline that uses 100% of the data.
arXiv Detail & Related papers (2021-10-15T21:46:39Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.