Modeling Coreference Relations in Visual Dialog
- URL: http://arxiv.org/abs/2203.02986v1
- Date: Sun, 6 Mar 2022 15:22:24 GMT
- Title: Modeling Coreference Relations in Visual Dialog
- Authors: Mingxiao Li, Marie-Francine Moens
- Abstract summary: The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
- Score: 18.926582410644375
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Visual dialog is a vision-language task where an agent needs to answer a
series of questions grounded in an image based on the understanding of the
dialog history and the image. The occurrences of coreference relations in the
dialog makes it a more challenging task than visual question-answering. Most
previous works have focused on learning better multi-modal representations or
on exploring different ways of fusing visual and language features, while the
coreferences in the dialog are mainly ignored. In this paper, based on
linguistic knowledge and discourse features of human dialog we propose two soft
constraints that can improve the model's ability of resolving coreferences in
dialog in an unsupervised way. Experimental results on the VisDial v1.0 dataset
shows that our model, which integrates two novel and linguistically inspired
soft constraints in a deep transformer neural architecture, obtains new
state-of-the-art performance in terms of recall at 1 and other evaluation
metrics compared to current existing models and this without pretraining on
other vision-language datasets. Our qualitative results also demonstrate the
effectiveness of the method that we propose.
Related papers
- Visualizing Dialogues: Enhancing Image Selection through Dialogue Understanding with Large Language Models [25.070424546200293]
We present a novel approach leveraging the robust reasoning capabilities of large language models (LLMs) to generate precise dialogue-associated visual descriptors.
Experiments conducted on benchmark data validate the effectiveness of our proposed approach in deriving concise and accurate visual descriptors.
Our findings demonstrate the method's generalizability across diverse visual cues, various LLMs, and different datasets.
arXiv Detail & Related papers (2024-07-04T03:50:30Z) - IMAD: IMage-Augmented multi-modal Dialogue [0.043847653914745384]
This paper presents a novel perspective on multi-modal dialogue systems, which interprets the image in the context of the dialogue.
We propose a two-stage approach to automatically construct a multi-modal dialogue dataset.
In the first stage, we utilize text-to-image similarity and sentence similarity to identify which utterances could be replaced with an image.
In the second stage, we replace those utterances by selecting a subset of relevant images and filtering them with a visual question answering model.
arXiv Detail & Related papers (2023-05-17T18:38:10Z) - DialogZoo: Large-Scale Dialog-Oriented Task Learning [52.18193690394549]
We aim to build a unified foundation model which can solve massive diverse dialogue tasks.
To achieve this goal, we first collect a large-scale well-labeled dialogue dataset from 73 publicly available datasets.
arXiv Detail & Related papers (2022-05-25T11:17:16Z) - Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog [12.034554338597067]
We propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK)
In our model, the external knowledge is represented with sentence-level facts and graph-level facts.
On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features.
arXiv Detail & Related papers (2022-04-10T13:12:10Z) - VU-BERT: A Unified framework for Visual Dialog [34.4815433301286]
We propose a unified framework for image-text joint embedding, named VU-BERT, and apply patch projection to obtain vision embedding in visual dialog tasks.
The model is trained over two tasks: masked language modeling and next utterance retrieval.
arXiv Detail & Related papers (2022-02-22T10:20:14Z) - Representation Learning for Conversational Data using Discourse Mutual
Information Maximization [9.017156603976915]
We argue that the structure-unaware word-by-word generation is not suitable for effective conversation modeling.
We propose a structure-aware Mutual Information based loss-function DMI for training dialog-representation models.
Our models show the most promising performance on the dialog evaluation task DailyDialog++, in both random and adversarial negative scenarios.
arXiv Detail & Related papers (2021-12-04T13:17:07Z) - Dialogue History Matters! Personalized Response Selectionin Multi-turn
Retrieval-based Chatbots [62.295373408415365]
We propose a personalized hybrid matching network (PHMN) for context-response matching.
Our contributions are two-fold: 1) our model extracts personalized wording behaviors from user-specific dialogue history as extra matching information.
We evaluate our model on two large datasets with user identification, i.e., personalized dialogue Corpus Ubuntu (P- Ubuntu) and personalized Weibo dataset (P-Weibo)
arXiv Detail & Related papers (2021-03-17T09:42:11Z) - Ranking Enhanced Dialogue Generation [77.8321855074999]
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation.
Previous works usually employ various neural network architectures to model the history.
This paper proposes a Ranking Enhanced Dialogue generation framework.
arXiv Detail & Related papers (2020-08-13T01:49:56Z) - Video-Grounded Dialogues with Pretrained Generation Language Models [88.15419265622748]
We leverage the power of pre-trained language models for improving video-grounded dialogue.
We propose a framework by formulating sequence-to-grounded dialogue tasks as a sequence-to-grounded task.
Our framework allows fine-tuning language models to capture dependencies across multiple modalities.
arXiv Detail & Related papers (2020-06-27T08:24:26Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.