Related papers: Modality-Balanced Models for Visual Dialogue

Modality-Balanced Models for Visual Dialogue

URL: http://arxiv.org/abs/2001.06354v1
Date: Fri, 17 Jan 2020 14:57:12 GMT
Title: Modality-Balanced Models for Visual Dialogue
Authors: Hyounghun Kim, Hao Tan, Mohit Bansal
Abstract summary: The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history. We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
Score: 102.35406085738325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue. However, via manual analysis, we find that a large number of conversational questions can be answered by only looking at the image without any access to the context history, while others still need the conversation context to predict the correct answers. We demonstrate that due to this reason, previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history (e.g., by extracting certain keywords or patterns in the context information), whereas image-only models are more generalizable (because they cannot memorize or extract keywords from history) and perform substantially better at the primary normalized discounted cumulative gain (NDCG) task metric which allows multiple correct answers. Hence, this observation encourages us to explicitly maintain two models, i.e., an image-only model and an image-history joint model, and combine their complementary abilities for a more balanced multimodal model. We present multiple methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters. Empirically, our models achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and high balance across metrics), and substantially outperform the winner of the Visual Dialog challenge 2018 on most metrics.

Related papers

Vision-Speech Models: Teaching Speech Models to Converse about Images [67.62394024470528]
We introduce MoshiVis, augmenting a recent dialogue speech LLM, Moshi, with visual inputs through lightweight adaptation modules. An additional dynamic gating mechanism enables the model to more easily switch between the visual inputs and unrelated conversation topics. We evaluate the model on downstream visual understanding tasks with both audio and text prompts, and report qualitative samples of interactions with MoshiVis.
arXiv Detail & Related papers (2025-03-19T18:40:45Z)
BI-MDRG: Bridging Image History in Multimodal Dialogue Response Generation [21.052101309555464]
Multimodal Dialogue Response Generation (MDRG) is a recently proposed task where the model needs to generate responses in texts, images, or a blend of both. Previous work relies on the text modality as an intermediary step for both the image input and output of the model rather than adopting an end-to-end approach. We propose BI-MDRG that bridges the response generation path such that the image history information is utilized for enhanced relevance of text responses to the image content.
arXiv Detail & Related papers (2024-08-12T05:22:42Z)
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models [60.81438804824749]
Multimodal instruction-following models extend capabilities by integrating both text and images. Existing models such as MiniGPT-4 and LLaVA face challenges in maintaining dialogue coherence in scenarios involving multiple images. We introduce SparklesDialogue, the first machine-generated dialogue dataset tailored for word-level interleaved multi-image and text interactions. We then present SparklesChat, a multimodal instruction-following model for open-ended dialogues across multiple images.
arXiv Detail & Related papers (2023-08-31T05:15:27Z)
IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue. We propose a two-stage approach to automatically construct a multi-modal dialogue dataset. In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image. In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z)
Collaborative Reasoning on Multi-Modal Semantic Graphs for Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video. The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs) We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z)
DialogZoo: Large-Scale Dialog-Oriented Task Learning [52.18193690394549]
We aim to build a unified foundation model which can solve massive diverse dialogue tasks. To achieve this goal, we first collect a large-scale well-labeled dialogue dataset from 73 publicly available datasets.
arXiv Detail & Related papers (2022-05-25T11:17:16Z)
On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs. Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z)
Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog [12.034554338597067]
We propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK) In our model, the external knowledge is represented with sentence-level facts and graph-level facts. On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features.
arXiv Detail & Related papers (2022-04-10T13:12:10Z)
Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z)
Multimodal Dialogue Response Generation [27.611204319057393]
We present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response. We consider multimodal dialogue generation under a natural assumption that only limited training examples are available. In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire model.
arXiv Detail & Related papers (2021-10-16T08:52:26Z)
Dialogue History Matters! Personalized Response Selectionin Multi-turn Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching. Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information. We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.