MPCHAT: Towards Multimodal Persona-Grounded Conversation
- URL: http://arxiv.org/abs/2305.17388v1
- Date: Sat, 27 May 2023 06:46:42 GMT
- Title: MPCHAT: Towards Multimodal Persona-Grounded Conversation
- Authors: Jaewoo Ahn, Yeda Song, Sangdoo Yun, Gunhee Kim
- Abstract summary: We extend persona-based dialogue to the multimodal domain and make two main contributions.
First, we present the first multimodal persona-based dialogue dataset named MPCHAT.
Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks, leads to statistically significant performance improvements.
- Score: 54.800425322314105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to build self-consistent personalized dialogue agents, previous
research has mostly focused on textual persona that delivers personal facts or
personalities. However, to fully describe the multi-faceted nature of persona,
image modality can help better reveal the speaker's personal characteristics
and experiences in episodic memory (Rubin et al., 2003; Conway, 2009). In this
work, we extend persona-based dialogue to the multimodal domain and make two
main contributions. First, we present the first multimodal persona-based
dialogue dataset named MPCHAT, which extends persona with both text and images
to contain episodic memories. Second, we empirically show that incorporating
multimodal persona, as measured by three proposed multimodal persona-grounded
dialogue tasks (i.e., next response prediction, grounding persona prediction,
and speaker identification), leads to statistically significant performance
improvements across all tasks. Thus, our work highlights that multimodal
persona is crucial for improving multimodal dialogue comprehension, and our
MPCHAT serves as a high-quality resource for this research.
Related papers
- M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews [1.8100046713740954]
We present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification.
Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour.
With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification.
arXiv Detail & Related papers (2024-04-04T09:17:22Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Enhancing Personalized Dialogue Generation with Contrastive Latent
Variables: Combining Sparse and Dense Persona [16.90863217077699]
Existing personalized dialogue agents model persona profiles from three resources: sparse or dense persona descriptions and dialogue histories.
We combine the advantages of the three resources to obtain a richer and more accurate persona.
Experimental results on Chinese and English datasets demonstrate our model's superiority in personalization.
arXiv Detail & Related papers (2023-05-19T07:24:27Z) - Speaker Profiling in Multiparty Conversations [31.518453682472575]
This research paper explores the task of Speaker Profiling in Conversations (SPC)
The primary objective of SPC is to produce a summary of persona characteristics for each individual speaker present in a dialogue.
To address the task of SPC, we have curated a new dataset named SPICE, which comes with specific labels.
arXiv Detail & Related papers (2023-04-18T08:04:46Z) - M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database [139.08528216461502]
We propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED.
M3ED contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9,082 turns and 24,449 utterances.
To the best of our knowledge, M3ED is the first multimodal emotional dialogue dataset in Chinese.
arXiv Detail & Related papers (2022-05-09T06:52:51Z) - MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation
Understanding [58.95156916558384]
We present MPC-BERT, a pre-trained model for MPC understanding.
We evaluate MPC-BERT on three downstream tasks including addressee recognition, speaker identification and response selection.
arXiv Detail & Related papers (2021-06-03T01:49:12Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Detecting depression in dyadic conversations with multimodal narratives
and visualizations [1.4824891788575418]
In this paper, we develop a system that supports humans to analyze conversations.
We demonstrate the ability of our system to take in a wide range of multimodal information and automatically generated a prediction score for the depression state of the individual.
arXiv Detail & Related papers (2020-01-13T10:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.