Related papers: KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus

URL: http://arxiv.org/abs/2503.06899v1
Date: Mon, 10 Mar 2025 04:05:38 GMT
Title: KwaiChat: A Large-Scale Video-Driven Multilingual Mixed-Type Dialogue Corpus
Authors: Xiaoming Shi, Zeming Liu, Yiming Lei, Chenkai Zhang, Haitao Leng, Chuan Wang, Qingjie Liu, Wanxiang Che, Shaoguo Liu, Size Li, Yunhong Wang,
Abstract summary: We propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus.<n>The KwaiChat corpus contains a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics.<n>An analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation.
Score: 69.46707346122113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video-based dialogue systems, such as education assistants, have compelling application value, thereby garnering growing interest. However, the current video-based dialogue systems are limited by their reliance on a single dialogue type, which hinders their versatility in practical applications across a range of scenarios, including question-answering, emotional dialog, etc. In this paper, we identify this challenge as how to generate video-driven multilingual mixed-type dialogues. To mitigate this challenge, we propose a novel task and create a human-to-human video-driven multilingual mixed-type dialogue corpus, termed KwaiChat, containing a total of 93,209 videos and 246,080 dialogues, across 4 dialogue types, 30 domains, 4 languages, and 13 topics. Additionally, we establish baseline models on KwaiChat. An extensive analysis of 7 distinct LLMs on KwaiChat reveals that GPT-4o achieves the best performance but still cannot perform well in this situation even with the help of in-context learning and fine-tuning, which indicates that the task is not trivial and needs further research.

Related papers

TV-Dialogue: Crafting Theme-Aware Video Dialogues with Immersive Interaction [25.851857218815415]
We introduce Theme-aware Video Dialogue Crafting (TVDC), a novel task aimed at generating new dialogues that align with video content and adhere to user-specified themes.<n>TV-Dialogue is a novel multi-modal agent framework that ensures both theme alignment and visual consistency.<n>Our findings underscore the potential of TV-Dialogue for various applications, such as video re-creation, film dubbing and its use in downstream multimodal tasks.
arXiv Detail & Related papers (2025-01-31T08:04:32Z)
Can xLLMs Understand the Structure of Dialog? Exploring Multilingual Response Generation in Complex Scenarios [8.131774353504472]
We introduce XMP, a high-quality parallel Multilingual dataset sourced from Multi-party Podcast dialogues.<n>Each sample in the dataset features at least three participants discussing a wide range of topics, including society, culture, politics, and entertainment.<n>We uncover significant limitations in previously recognized multilingual capabilities of LLMs when applied to such complex dialogue scenarios.
arXiv Detail & Related papers (2025-01-20T04:33:03Z)
WavChat: A Survey of Spoken Dialogue Models [66.82775211793547]
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o, have captured significant attention in the speech domain. These advanced spoken dialogue models not only comprehend audio, music, and other speech-related features, but also capture stylistic and timbral characteristics in speech. Despite the progress in spoken dialogue systems, there is a lack of comprehensive surveys that systematically organize and analyze these systems.
arXiv Detail & Related papers (2024-11-15T04:16:45Z)
Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model. It processes audio-visual speech from user input and generates audio-visual speech as the response. We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z)
Conversations as a Source for Teaching Scientific Concepts at Different Education Levels [22.315652391541285]
This paper presents a novel source for facilitating conversational teaching of scientific concepts at various difficulty levels. We analyse this data source in various ways to show that it offers a diverse array of examples that can be used to generate contextually appropriate responses.
arXiv Detail & Related papers (2024-04-16T11:33:36Z)
BotChat: Evaluating LLMs' Capabilities of Having Multi-Turn Dialogues [72.65163468440434]
This report provides a preliminary evaluation of existing large language models for human-style multi-turn chatting. We prompt large language models (LLMs) to generate a full multi-turn dialogue based on the ChatSEED, utterance by utterance. We find GPT-4 can generate human-style multi-turn dialogues with impressive quality, significantly outperforms its counterparts.
arXiv Detail & Related papers (2023-10-20T16:53:51Z)
ChatPLUG: Open-Domain Generative Dialogue System with Internet-Augmented Instruction Tuning for Digital Human [76.62897301298699]
ChatPLUG is a Chinese open-domain dialogue system for digital human applications that instruction finetunes on a wide range of dialogue tasks in a unified internet-augmented format. We show that modelname outperforms state-of-the-art Chinese dialogue systems on both automatic and human evaluation. We deploy modelname to real-world applications such as Smart Speaker and Instant Message applications with fast inference.
arXiv Detail & Related papers (2023-04-16T18:16:35Z)
MMChat: Multi-Modal Chat Dataset on Social Media [8.904627457711683]
MMChat is a large scale multi-modal dialogue corpus (32.4M raw dialogues and 120.84K filtered dialogues) Unlike previous corpora that are crowd-sourced or collected from fictitious movies, MMChat contains image-grounded dialogues collected from real conversations on social media. We develop a benchmark model to address this issue in dialogue generation tasks by adapting the attention routing mechanism on image features.
arXiv Detail & Related papers (2021-08-16T15:27:49Z)
Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue. We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task. Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.