KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
- URL: http://arxiv.org/abs/2503.06899v1
- Date: Mon, 10 Mar 2025 04:05:38 GMT
- Title: KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
- Authors: Xiaoming Shi, Zeming Liu, Yiming Lei, Chenkai Zhang, Haitao Leng, Chuan Wang, Qingjie Liu, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang,
- Abstract summary: We propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus.<n>The KwaiChat corpus contains a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics.<n>An analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation.
- Score: 69.46707346122113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.
Related papers
- TV-Dialogue: Crafting Theme-Aware Video Dialogues with Immersive Interaction [25.851857218815415]
We introduce Theme-aware Video Dialogue Crafting (TVDC), a novel task aimed at generating new dialogues that align with video content and adhere to user-specified themes.<n>TV-Dialogue is a novel multi-modal agent framework that ensures both theme alignment and visual consistency.<n>Our findings underscore the potential of TV-Dialogue for various applications, such as video re-creation, film dubbing and its use in downstream multimodal tasks.
arXiv Detail & Related papers (2025-01-31T08:04:32Z) - Can xLLMs Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios [8.131774353504472]
We introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues.<n>Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and entertainment.<n>We uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios.
arXiv Detail & Related papers (2025-01-20T04:33:03Z) - WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain.
These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech.
Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z) - Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Conversations as a Source for Teaching Scientific Concepts at Different Education Levels [22.315652391541285]
This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels.
We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate responses.
arXiv Detail & Related papers (2024-04-16T11:33:36Z) - BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting.
We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance.
We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z) - ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented
Instruction Tuning for Digital Human [76.62897301298699]
ChatPLUG is a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format.
We show that modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation.
We deploy modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference.
arXiv Detail & Related papers (2023-04-16T18:16:35Z) - MMChat: Multi-Modal Chat Dataset on Social Media [8.904627457711683]
MMChat is a large scale multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues)
Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media.
We develop a benchmark model to address this issue in dialogue generation tasks by adapting the attention routing mechanism on image features.
arXiv Detail & Related papers (2021-08-16T15:27:49Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.