Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
- URL: http://arxiv.org/abs/2412.17295v1
- Date: Mon, 23 Dec 2024 05:32:48 GMT
- Title: Friends-MMC: A Dataset for Multi-modal Multi-party Conversation Understanding
- Authors: Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Qun Liu, Dongyan Zhao,
- Abstract summary: Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research.
MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context.
We present Friends-MMC, an MMC dataset that contains 24,000+ unique utterances paired with video context.
- Score: 44.870165050047355
- License:
- Abstract: Multi-modal multi-party conversation (MMC) is a less studied yet important topic of research due to that it well fits real-world scenarios and thus potentially has more widely-used applications. Compared with the traditional multi-modal conversations, MMC requires stronger character-centered understanding abilities as there are many interlocutors appearing in both the visual and textual context. To facilitate the study of this problem, we present Friends-MMC in this paper, an MMC dataset that contains 24,000+ unique utterances paired with video context. To explore the character-centered understanding of the dialogue, we also annotate the speaker of each utterance, the names and bounding bboxes of faces that appear in the video. Based on this Friends-MMC dataset, we further study two fundamental MMC tasks: conversation speaker identification and conversation response prediction, both of which have the multi-party nature with the video or image as visual context. For conversation speaker identification, we demonstrate the inefficiencies of existing methods such as pre-trained models, and propose a simple yet effective baseline method that leverages an optimization solver to utilize the context of two modalities to achieve better performance. For conversation response prediction, we fine-tune generative dialogue models on Friend-MMC, and analyze the benefits of speaker information. The code and dataset is publicly available at https://github.com/yellow-binary-tree/Friends-MMC and thus we call for more attention on modeling speaker information when understanding conversations.
Related papers
- Multi-Turn Multi-Modal Question Clarification for Enhanced Conversational Understanding [11.004677535859342]
We introduce the Multi-turn Multi-modal Clarifying Questions (MMCQ) task.
MMCQ combines text and visual modalities to refine user queries in a multi-turn conversation.
We show that multi-turn multi-modal clarification outperforms uni-modal and single-turn approaches, improving MRR by 12.88%.
arXiv Detail & Related papers (2025-02-17T04:58:14Z) - Data-Centric Improvements for Enhancing Multi-Modal Understanding in Spoken Conversation Modeling [13.628984890958314]
We introduce a data-centric customization approach for efficiently enhancing multimodal understanding in conversational speech modeling.
Our approach achieves state-of-the-art performance on the Spoken-SQuAD benchmark, using only 10% of the training data with open-weight models.
We also introduce ASK-QA, the first dataset for multi-turn spoken dialogue with ambiguous user requests and dynamic evaluation inputs.
arXiv Detail & Related papers (2024-12-20T15:43:09Z) - Let's Go Real Talk: Spoken Dialogue Model for Face-to-Face Conversation [55.043492250775294]
We introduce a novel Face-to-Face spoken dialogue model.
It processes audio-visual speech from user input and generates audio-visual speech as the response.
We also introduce MultiDialog, the first large-scale multimodal spoken dialogue corpus.
arXiv Detail & Related papers (2024-06-12T04:48:36Z) - Conversational Speech Recognition by Learning Audio-textual Cross-modal Contextual Representation [27.926862030684926]
We introduce a novel conversational ASR system, extending the Conformer encoder-decoder model with cross-modal conversational representation.
Our approach combines pre-trained speech and text models through a specialized encoder and a modal-level mask input.
By introducing both cross-modal and conversational representations into the decoder, our model retains context over longer sentences without information loss.
arXiv Detail & Related papers (2023-10-22T11:57:33Z) - Coreference-aware Double-channel Attention Network for Multi-party
Dialogue Reading Comprehension [7.353227696624305]
We tackle Multi-party Dialogue Reading (abbr., MDRC)
MDRC stands for an extractive reading comprehension task grounded on a batch of dialogues among multiple interlocutors.
We propose a coreference-aware attention modeling method to strengthen the reasoning ability.
arXiv Detail & Related papers (2023-05-15T05:01:29Z) - TikTalk: A Video-Based Dialogue Dataset for Multi-Modal Chitchat in Real
World [97.58623810402563]
We introduce a new video-based multi-modal dialogue dataset, called TikTalk.
We collect 38K videos from a popular video-sharing platform, along with 367K conversations posted by users beneath them.
Users engage in spontaneous conversations based on their multi-modal experiences from watching videos, which helps recreate real-world chitchat context.
arXiv Detail & Related papers (2023-01-14T10:18:22Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.