Related papers: On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval

URL: http://arxiv.org/abs/2506.11499v1
Date: Fri, 13 Jun 2025 06:50:02 GMT
Title: On the Effectiveness of Integration Methods for Multimodal Dialogue Response Retrieval
Authors: Seongbo Jang, Seonghyeon Lee, Dongha Lee, Hwanjo Yu,
Abstract summary: This work explores how a dialogue system can output responses in various modalities such as text and image.<n>We propose three integration methods based on a two-step approach and an end-to-end approach.
Score: 27.84217171879445
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Multimodal chatbots have become one of the major topics for dialogue systems in both research community and industry. Recently, researchers have shed light on the multimodality of responses as well as dialogue contexts. This work explores how a dialogue system can output responses in various modalities such as text and image. To this end, we first formulate a multimodal dialogue response retrieval task for retrieval-based systems as the combination of three subtasks. We then propose three integration methods based on a two-step approach and an end-to-end approach, and compare the merits and demerits of each method. Experimental results on two datasets demonstrate that the end-to-end approach achieves comparable performance without an intermediate step in the two-step approach. In addition, a parameter sharing strategy not only reduces the number of parameters but also boosts performance by transferring knowledge across the subtasks and the modalities.

Related papers

DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever [83.33209603041013]
We propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Our approach introduces a multi-modal context generator to learn context features which are distilled into prompts within the pre-trained vision-language model CLIP. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space.
arXiv Detail & Related papers (2024-01-02T07:40:12Z)
Self-Explanation Prompting Improves Dialogue Understanding in Large Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs) This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks. Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z)
FCC: Fusing Conversation History and Candidate Provenance for Contextual Response Ranking in Dialogue Systems [53.89014188309486]
We present a flexible neural framework that can integrate contextual information from multiple channels. We evaluate our model on the MSDialog dataset widely used for evaluating conversational response ranking tasks.
arXiv Detail & Related papers (2023-03-31T23:58:28Z)
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video. The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs) We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z)
DialogUSR: Complex Dialogue Utterance Splitting and Reformulation for Multiple Intent Detection [27.787807111516706]
Instead of training a dedicated multi-intent detection model, we propose DialogUSR. DialogUSR splits multi-intent user query into several single-intent sub-queries. It then recovers all the coreferred and omitted information in the sub-queries.
arXiv Detail & Related papers (2022-10-20T13:56:35Z)
Duplex Conversation: Towards Human-like Interaction in Spoken Dialogue System [120.70726465994781]
multimodal spoken dialogue system enables telephonebased agents to interact with customers like human. We deploy Conversation Duplex Alibaba intelligent customer service to share lessons learned in production. Online A/B experiments show in proposed system can significantly reduce response latency by 50%.
arXiv Detail & Related papers (2022-05-30T12:41:23Z)
Two-Level Supervised Contrastive Learning for Response Selection in Multi-Turn Dialogue [18.668723854662584]
This paper applies contrastive learning to the problem by using the supervised contrastive loss. We develop a new method for supervised contrastive learning, referred to as two-level supervised contrastive learning.
arXiv Detail & Related papers (2022-03-01T23:43:36Z)
Retrieve & Memorize: Dialog Policy Learning with Multi-Action Memory [13.469140432108151]
We propose a retrieve-and-memorize framework to enhance the learning of system actions. We use a memory-augmented multi-decoder network to generate the system actions conditioned on the candidate actions. Our method achieves competitive performance among several state-of-the-art models in the context-to-response generation task.
arXiv Detail & Related papers (2021-06-04T07:53:56Z)
Dialogue-Based Relation Extraction [53.2896545819799]
We present the first human-annotated dialogue-based relation extraction (RE) dataset DialogRE. We argue that speaker-related information plays a critical role in the proposed task, based on an analysis of similarities and differences between dialogue-based and traditional RE tasks. Experimental results demonstrate that a speaker-aware extension on the best-performing model leads to gains in both the standard and conversational evaluation settings.
arXiv Detail & Related papers (2020-04-17T03:51:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.