TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World
- URL: http://arxiv.org/abs/2301.05880v3
- Date: Fri, 8 Sep 2023 10:03:16 GMT
- Title: TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World
- Authors: Hongpeng Lin, Ludan Ruan, Wenke Xia, Peiyu Liu, Jingyuan Wen, Yixin
Xu, Di Hu, Ruihua Song, Wayne Xin Zhao, Qin Jin and Zhiwu Lu
- Abstract summary: We introduce a new video-based multi-modal dialogue dataset, called TikTalk.
We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them.
Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
- Score: 97.58623810402563
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To facilitate the research on intelligent and human-like chatbots with
multi-modal context, we introduce a new video-based multi-modal dialogue
dataset, called TikTalk. We collect 38K videos from a popular video-sharing
platform, along with 367K conversations posted by users beneath them. Users
engage in spontaneous conversations based on their multi-modal experiences from
watching videos, which helps recreate real-world chitchat context. Compared to
previous multi-modal dialogue datasets, the richer context types in TikTalk
lead to more diverse conversations, but also increase the difficulty in
capturing human interests from intricate multi-modal information to generate
personalized responses. Moreover, external knowledge is more frequently evoked
in our dataset. These facts reveal new challenges for multi-modal dialogue
models. We quantitatively demonstrate the characteristics of TikTalk, propose a
video-based multi-modal chitchat task, and evaluate several dialogue baselines.
Experimental results indicate that the models incorporating large language
models (LLM) can generate more diverse responses, while the model utilizing
knowledge graphs to introduce external knowledge performs the best overall.
Furthermore, no existing model can solve all the above challenges well. There
is still a large room for future improvements, even for LLM with visual
extensions. Our dataset is available at
\url{https://ruc-aimind.github.io/projects/TikTalk/}.
Related papers
- Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models [59.525108086957296]
Video-ChatGPT is a multimodal model that merges a video-adapted visual encoder with an LLM.
It is capable of understanding and generating detailed conversations about videos.
We introduce a new dataset of 100,000 video-instruction pairs used to train Video-ChatGPT.
arXiv Detail & Related papers (2023-06-08T17:59:56Z) - Affective Faces for Goal-Driven Dyadic Communication [16.72177738101024]
We introduce a video framework for modeling the association between verbal and non-verbal communication during dyadic conversation.
Our approach retrieves a video of a listener, who has facial expressions that would be socially appropriate given the context.
arXiv Detail & Related papers (2023-01-26T05:00:09Z) - Exploring Effective Information Utilization in Multi-Turn Topic-Driven
Conversations [11.550422073645425]
We encode topic and dialogue history information using certain prompts with multiple channels of Fusion-in-Decoder (FiD)
In this paper, our experiments focus on a specific Chinese dataset named NaturalConv, where the conversation revolves around a piece of recent news.
arXiv Detail & Related papers (2022-09-01T06:20:39Z) - HybriDialogue: An Information-Seeking Dialogue Dataset Grounded on
Tabular and Textual Data [87.67278915655712]
We present a new dialogue dataset, HybriDialogue, which consists of crowdsourced natural conversations grounded on both Wikipedia text and tables.
The conversations are created through the decomposition of complex multihop questions into simple, realistic multiturn dialogue interactions.
arXiv Detail & Related papers (2022-04-28T00:52:16Z) - Building Goal-Oriented Dialogue Systems with Situated Visual Context [12.014793558784955]
With the surge of virtual assistants with screen, the next generation of agents are required to understand screen context.
We propose a novel multimodal conversational framework, where the dialogue agent's next action and their arguments are derived jointly conditioned both on the conversational and the visual context.
Our model can recognize visual features such as color and shape as well as the metadata based features such as price or star rating associated with a visual entity.
arXiv Detail & Related papers (2021-11-22T23:30:52Z) - OpenViDial: A Large-Scale, Open-Domain Dialogue Dataset with Visual
Contexts [35.57757367869986]
We release bf OpenViDial, a large-scale multi- module dialogue dataset.
OpenViDial contains a total number of 1.1 million dialogue turns.
We propose a family of encoder-decoder models leveraging both textual and visual contexts.
arXiv Detail & Related papers (2020-12-30T03:02:50Z) - KdConv: A Chinese Multi-domain Dialogue Dataset Towards Multi-turn
Knowledge-driven Conversation [66.99734491847076]
We propose a Chinese multi-domain knowledge-driven conversation dataset, KdConv, which grounds the topics in multi-turn conversations to knowledge graphs.
Our corpus contains 4.5K conversations from three domains (film, music, and travel), and 86K utterances with an average turn number of 19.0.
arXiv Detail & Related papers (2020-04-08T16:25:39Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.