LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically
Constructed from Live Streaming
- URL: http://arxiv.org/abs/2306.08401v1
- Date: Wed, 14 Jun 2023 09:50:06 GMT
- Title: LiveChat: A Large-Scale Personalized Dialogue Dataset Automatically
Constructed from Live Streaming
- Authors: Jingsheng Gao, Yixin Lian, Ziyi Zhou, Yuzhuo Fu, Baoyuan Wang
- Abstract summary: We introduce the LiveChat dataset, composed of 1.33 million real-life Chinese dialogues with almost 3800 average sessions across 351 personas and fine-grained profiles for each persona.
We target two critical tasks of response modeling and addressee recognition and propose retrieval-based baselines grounded on advanced techniques.
- Score: 11.88939304751663
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-domain dialogue systems have made promising progress in recent years.
While the state-of-the-art dialogue agents are built upon large-scale
text-based social media data and large pre-trained models, there is no
guarantee these agents could also perform well in fast-growing scenarios, such
as live streaming, due to the bounded transferability of pre-trained models and
biased distributions of public datasets from Reddit and Weibo, etc. To improve
the essential capability of responding and establish a benchmark in the live
open-domain scenario, we introduce the LiveChat dataset, composed of 1.33
million real-life Chinese dialogues with almost 3800 average sessions across
351 personas and fine-grained profiles for each persona. LiveChat is
automatically constructed by processing numerous live videos on the Internet
and naturally falls within the scope of multi-party conversations, where the
issues of Who says What to Whom should be considered. Therefore, we target two
critical tasks of response modeling and addressee recognition and propose
retrieval-based baselines grounded on advanced techniques. Experimental results
have validated the positive effects of leveraging persona profiles and larger
average sessions per persona. In addition, we also benchmark the
transferability of advanced generation-based models on LiveChat and pose some
future directions for current challenges.
Related papers
- ShareChat: A Dataset of Chatbot Conversations in the Wild [11.008120181455316]
We present ShareChat, a large-scale, cross-platform corpus comprising 142,808 conversations and over 660,000 turns collected from publicly shared URLs across five major platforms.<n>We show ShareChat offers substantially longer context windows and greater interaction depth than prior datasets.
arXiv Detail & Related papers (2025-12-19T17:47:53Z) - ConvFill: Model Collaboration for Responsive Conversational Voice Agents [6.166061057506208]
We propose conversational infill, a task where a lightweight on-device model generates contextually appropriate dialogue while seamlessly incorporating streaming knowledge from a powerful backend model.<n>We present ConvFill, a 360M parameter model trained on synthetic multi-domain conversations.<n>We show that conversational infill can be successfully learned, with ConvFill achieving accuracy improvements of 36-42% over standalone small models of the same size while consistently retaining sub-200ms response latencies.
arXiv Detail & Related papers (2025-11-10T18:50:30Z) - Proactive Assistant Dialogue Generation from Streaming Egocentric Videos [48.30863954384779]
This work lays the foundation for developing real-time, proactive AI assistants capable of guiding users through diverse tasks.<n>First, we introduce a novel data curation pipeline that synthesizes dialogues from annotated egocentric videos.<n>Second, we develop a suite of automatic evaluation metrics, validated through extensive human studies.<n>Third, we propose an end-to-end model that processes streaming video inputs to generate contextually appropriate responses.
arXiv Detail & Related papers (2025-06-06T09:23:29Z) - Evaluating Very Long-Term Conversational Memory of LLM Agents [95.84027826745609]
We introduce a machine-human pipeline to generate high-quality, very long-term dialogues.
We equip each agent with the capability of sharing and reacting to images.
The generated conversations are verified and edited by human annotators for long-range consistency.
arXiv Detail & Related papers (2024-02-27T18:42:31Z) - Enhancing Chat Language Models by Scaling High-quality Instructional
Conversations [91.98516412612739]
We first provide a systematically designed, diverse, informative, large-scale dataset of instructional conversations, UltraChat.
Our objective is to capture the breadth of interactions that a human might have with an AI assistant.
We fine-tune a LLaMA model to create a powerful conversational model, UltraLLaMA.
arXiv Detail & Related papers (2023-05-23T16:49:14Z) - Deploying a Retrieval based Response Model for Task Oriented Dialogues [8.671263996400844]
Task-oriented dialogue systems need to have high conversational capability, be easily adaptable to changing situations and conform to business constraints.
This paper describes a 3-step procedure to develop a conversational model that satisfies these criteria and can efficiently scale to rank a large set of response candidates.
arXiv Detail & Related papers (2022-10-25T23:10:19Z) - Towards Efficient Dialogue Pre-training with Transferable and
Interpretable Latent Structure [77.30953347462452]
This paper proposes a novel dialogue generation model with a latent structure that is easily transferable from the general domain to downstream tasks in a lightweight and transparent way.
Thanks to the transferable latent structure, our model is able to yield better dialogue responses than four strong baselines in terms of both automatic and human evaluations.
arXiv Detail & Related papers (2022-10-22T14:46:43Z) - Grounding in social media: An approach to building a chit-chat dialogue
model [9.247397520986999]
Building open-domain dialogue systems capable of rich human-like conversational ability is one of the fundamental challenges in language generation.
Current work on knowledge-grounded dialogue generation primarily focuses on persona incorporation or searching a fact-based structured knowledge source such as Wikipedia.
Our method takes a broader and simpler approach, which aims to improve the raw conversation ability of the system by mimicking the human response behavior on social media.
arXiv Detail & Related papers (2022-06-12T09:01:57Z) - Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue
System [120.70726465994781]
multimodal spoken dialogue system enables telephonebased agents to interact with customers like human.
We deploy Conversation Duplex Alibaba intelligent customer service to share lessons learned in production.
Online A/B experiments show in proposed system can significantly reduce response latency by 50%.
arXiv Detail & Related papers (2022-05-30T12:41:23Z) - Training Conversational Agents with Generative Conversational Networks [74.9941330874663]
We use Generative Conversational Networks to automatically generate data and train social conversational agents.
We evaluate our approach on TopicalChat with automatic metrics and human evaluators, showing that with 10% of seed data it performs close to the baseline that uses 100% of the data.
arXiv Detail & Related papers (2021-10-15T21:46:39Z) - An Exploratory Study on Long Dialogue Summarization: What Works and
What's Next [33.1899354772074]
We study long dialogue summarization by investigating three strategies to deal with the lengthy input problem and locate relevant information.
Our experimental results on three long dialogue datasets (QMSum, MediaSum, SummScreen) show that the retrieve-then-summarize pipeline models yield the best performance.
arXiv Detail & Related papers (2021-09-10T01:38:26Z) - Pchatbot: A Large-Scale Dataset for Personalized Chatbot [49.16746174238548]
We introduce Pchatbot, a large-scale dialogue dataset that contains two subsets collected from Weibo and Judicial forums respectively.
To adapt the raw dataset to dialogue systems, we elaborately normalize the raw dataset via processes such as anonymization.
The scale of Pchatbot is significantly larger than existing Chinese datasets, which might benefit the data-driven models.
arXiv Detail & Related papers (2020-09-28T12:49:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.