MPCHAT: Towards Multimodal Persona-Grounded Conversation
- URL: http://arxiv.org/abs/2305.17388v1
- Date: Sat, 27 May 2023 06:46:42 GMT
- Title: MPCHAT: Towards Multimodal Persona-Grounded Conversation
- Authors: Jaewoo Ahn, Yeda Song, Sangdoo Yun, Gunhee Kim
- Abstract summary: We extend persona-based dialogue to the multimodal domain and make two main contributions.
First, we present the first multimodal persona-based dialogue dataset named MPCHAT.
Second, we empirically show that incorporating multimodal persona, as measured by three proposed multimodal persona-grounded dialogue tasks, leads to statistically significant performance improvements.
- Score: 54.800425322314105
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In order to build self-consistent personalized dialogue agents, previous
research has mostly focused on textual persona that delivers personal facts or
personalities. However, to fully describe the multi-faceted nature of persona,
image modality can help better reveal the speaker's personal characteristics
and experiences in episodic memory (Rubin et al., 2003; Conway, 2009). In this
work, we extend persona-based dialogue to the multimodal domain and make two
main contributions. First, we present the first multimodal persona-based
dialogue dataset named MPCHAT, which extends persona with both text and images
to contain episodic memories. Second, we empirically show that incorporating
multimodal persona, as measured by three proposed multimodal persona-grounded
dialogue tasks (i.e., next response prediction, grounding persona prediction,
and speaker identification), leads to statistically significant performance
improvements across all tasks. Thus, our work highlights that multimodal
persona is crucial for improving multimodal dialogue comprehension, and our
MPCHAT serves as a high-quality resource for this research.
Related papers
- Post Persona Alignment for Multi-Session Dialogue Generation [25.115319934091282]
Post Persona Alignment (PPA) is a novel two-stage framework for personalized dialogue generation.<n>It first generates a general response based solely on dialogue context, then retrieves relevant persona memories using the response as a query, and finally refines the response to align with the speaker's persona.<n>Experiments on multi-session LLM-generated dialogue data demonstrate that PPA significantly outperforms prior approaches in consistency, diversity, and persona relevance.
arXiv Detail & Related papers (2025-06-13T15:04:01Z) - When Harry Meets Superman: The Role of The Interlocutor in Persona-Based Dialogue Generation [18.650805984660707]
Endowing dialogue agents with persona information has proven to significantly improve the consistency and diversity of their generations.<n>While much focus has been placed on aligning dialogues with provided personas, the adaptation to the interlocutor's profile remains largely underexplored.<n>We investigate three key aspects: (1) a model's ability to align responses with both the provided persona and the interlocutor's; (2) its robustness when dealing with familiar versus unfamiliar interlocutors and topics, and (3) the impact of additional fine-tuning on specific persona-based dialogues.
arXiv Detail & Related papers (2025-05-30T14:04:30Z) - Multimodal Conversation Structure Understanding [12.29827265137757]
Large language models' ability to understand fine-grained conversational structure remains underexplored.<n>We present a human annotated dataset of 4,398 annotations for speakers and reply-to relationship, 5,755 addressees, and 3,142 side-participants.<n>We evaluate popular audio-visual LLMs and vision-language models on our dataset, and our experimental results suggest that multimodal conversational structure understanding remains challenging.
arXiv Detail & Related papers (2025-05-23T06:41:54Z) - M3TCM: Multi-modal Multi-task Context Model for Utterance Classification in Motivational Interviews [1.8100046713740954]
We present M3TCM, a Multi-modal, Multi-task Context Model for utterance classification.
Our approach for the first time employs multi-task learning to effectively model both joint and individual components of therapist and client behaviour.
With our novel approach, we outperform the state of the art for utterance classification on the recently introduced AnnoMI dataset with a relative improvement of 20% for the client- and by 15% for therapist utterance classification.
arXiv Detail & Related papers (2024-04-04T09:17:22Z) - DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval.
Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP.
To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z) - Enhancing Personalized Dialogue Generation with Contrastive Latent
Variables: Combining Sparse and Dense Persona [16.90863217077699]
Existing personalized dialogue agents model persona profiles from three resources: sparse or dense persona descriptions and dialogue histories.
We combine the advantages of the three resources to obtain a richer and more accurate persona.
Experimental results on Chinese and English datasets demonstrate our model's superiority in personalization.
arXiv Detail & Related papers (2023-05-19T07:24:27Z) - Speaker Profiling in Multiparty Conversations [31.518453682472575]
This research paper explores the task of Speaker Profiling in Conversations (SPC)
The primary objective of SPC is to produce a summary of persona characteristics for each individual speaker present in a dialogue.
To address the task of SPC, we have curated a new dataset named SPICE, which comes with specific labels.
arXiv Detail & Related papers (2023-04-18T08:04:46Z) - M3ED: Multi-modal Multi-scene Multi-label Emotional Dialogue Database [139.08528216461502]
We propose a Multi-modal Multi-scene Multi-label Emotional Dialogue dataset, M3ED.
M3ED contains 990 dyadic emotional dialogues from 56 different TV series, a total of 9,082 turns and 24,449 utterances.
To the best of our knowledge, M3ED is the first multimodal emotional dialogue dataset in Chinese.
arXiv Detail & Related papers (2022-05-09T06:52:51Z) - MPC-BERT: A Pre-Trained Language Model for Multi-Party Conversation
Understanding [58.95156916558384]
We present MPC-BERT, a pre-trained model for MPC understanding.
We evaluate MPC-BERT on three downstream tasks including addressee recognition, speaker identification and response selection.
arXiv Detail & Related papers (2021-06-03T01:49:12Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Detecting depression in dyadic conversations with multimodal narratives
and visualizations [1.4824891788575418]
In this paper, we develop a system that supports humans to analyze conversations.
We demonstrate the ability of our system to take in a wide range of multimodal information and automatically generated a prediction score for the depression state of the individual.
arXiv Detail & Related papers (2020-01-13T10:47:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.