Modality-Balanced Models for Visual Dialogue
- URL: http://arxiv.org/abs/2001.06354v1
- Date: Fri, 17 Jan 2020 14:57:12 GMT
- Title: Modality-Balanced Models for Visual Dialogue
- Authors: Hyounghun Kim, Hao Tan, Mohit Bansal
- Abstract summary: The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
- Score: 102.35406085738325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Visual Dialog task requires a model to exploit both image and
conversational context information to generate the next response to the
dialogue. However, via manual analysis, we find that a large number of
conversational questions can be answered by only looking at the image without
any access to the context history, while others still need the conversation
context to predict the correct answers. We demonstrate that due to this reason,
previous joint-modality (history and image) models over-rely on and are more
prone to memorizing the dialogue history (e.g., by extracting certain keywords
or patterns in the context information), whereas image-only models are more
generalizable (because they cannot memorize or extract keywords from history)
and perform substantially better at the primary normalized discounted
cumulative gain (NDCG) task metric which allows multiple correct answers.
Hence, this observation encourages us to explicitly maintain two models, i.e.,
an image-only model and an image-history joint model, and combine their
complementary abilities for a more balanced multimodal model. We present
multiple methods for this integration of the two models, via ensemble and
consensus dropout fusion with shared parameters. Empirically, our models
achieve strong results on the Visual Dialog challenge 2019 (rank 3 on NDCG and
high balance across metrics), and substantially outperform the winner of the
Visual Dialog challenge 2018 on most metrics.
Related papers
- IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - Collaborative Reasoning on Multi-Modal Semantic Graphs for
Video-Grounded Dialogue Generation [53.87485260058957]
We study video-grounded dialogue generation, where a response is generated based on the dialogue context and the associated video.
The primary challenges of this task lie in (1) the difficulty of integrating video data into pre-trained language models (PLMs)
We propose a multi-agent reinforcement learning method to collaboratively perform reasoning on different modalities.
arXiv Detail & Related papers (2022-10-22T14:45:29Z) - DialogZoo: Large-Scale Dialog-Oriented Task Learning [52.18193690394549]
We aim to build a unified foundation model which can solve massive diverse dialogue tasks.
To achieve this goal, we first collect a large-scale well-labeled dialogue dataset from 73 publicly available datasets.
arXiv Detail & Related papers (2022-05-25T11:17:16Z) - On Advances in Text Generation from Images Beyond Captioning: A Case
Study in Self-Rationalization [89.94078728495423]
We show that recent advances in each modality, CLIP image representations and scaling of language models, do not consistently improve multimodal self-rationalization of tasks with multimodal inputs.
Our findings call for a backbone modelling approach that can be built on to advance text generation from images and text beyond image captioning.
arXiv Detail & Related papers (2022-05-24T00:52:40Z) - Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog [12.034554338597067]
We propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK)
In our model, the external knowledge is represented with sentence-level facts and graph-level facts.
On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features.
arXiv Detail & Related papers (2022-04-10T13:12:10Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - Multimodal Dialogue Response Generation [27.611204319057393]
We present a multimodal dialogue generation model, which takes the dialogue history as input, then generates a textual sequence or an image as response.
We consider multimodal dialogue generation under a natural assumption that only limited training examples are available.
In such a low-resource setting, we devise a novel conversational agent, Divter, in order to isolate parameters that depend on multimodal dialogues from the entire model.
arXiv Detail & Related papers (2021-10-16T08:52:26Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - The Adapter-Bot: All-In-One Controllable Conversational Model [66.48164003532484]
We propose a dialogue model that uses a fixed backbone model such as DialGPT and triggers on-demand dialogue skills via different adapters.
Depending on the skills, the model is able to process multiple knowledge types, such as text, tables, and emphatic responses.
We evaluate our model using automatic evaluation by comparing it with existing state-of-the-art conversational models.
arXiv Detail & Related papers (2020-08-28T10:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.