History for Visual Dialog: Do we really need it?
- URL: http://arxiv.org/abs/2005.07493v1
- Date: Fri, 8 May 2020 14:58:09 GMT
- Title: History for Visual Dialog: Do we really need it?
- Authors: Shubham Agarwal, Trung Bui, Joon-Young Lee, Ioannis Konstas, Verena
Rieser
- Abstract summary: We show that co-attention models which explicitly encode dialog history outperform models that don't.
We also expose shortcomings of the crowd-sourcing dataset collection procedure.
- Score: 55.642625058602924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Dialog involves "understanding" the dialog history (what has been
discussed previously) and the current question (what is asked), in addition to
grounding information in the image, to generate the correct response. In this
paper, we show that co-attention models which explicitly encode dialog history
outperform models that don't, achieving state-of-the-art performance (72 % NDCG
on val set). However, we also expose shortcomings of the crowd-sourcing dataset
collection procedure by showing that history is indeed only required for a
small amount of the data and that the current evaluation metric encourages
generic replies. To that end, we propose a challenging subset (VisDialConv) of
the VisDial val set and provide a benchmark of 63% NDCG.
Related papers
- Enhancing Visual Dialog State Tracking through Iterative Object-Entity Alignment in Multi-Round Conversations [3.784841749866846]
We introduce Multi-round Dialogue State Tracking model (MDST)
MDST captures each round of dialog history, constructing internal dialogue state representations defined as 2-tuples of vision-language representations.
Experimental results on the VisDial v1.0 dataset demonstrate that MDST achieves a new state-of-the-art performance in generative setting.
arXiv Detail & Related papers (2024-08-13T08:36:15Z) - InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
Multimodal and Language Models [123.1441379479263]
We build a visual dialogue dataset, named InfoVisDial, which provides rich informative answers in each round.
For effective data collection, the key idea is to bridge the large-scale multimodal model (e.g., GIT) and the language models (e.g., GPT-3)
arXiv Detail & Related papers (2023-12-21T00:44:45Z) - q2d: Turning Questions into Dialogs to Teach Models How to Search [11.421839177607147]
We propose q2d: an automatic data generation pipeline that generates information-seeking dialogs from questions.
Unlike previous approaches which relied on human written dialogs with search queries, our method allows to automatically generate query-based grounded dialogs with better control and scale.
arXiv Detail & Related papers (2023-04-27T16:39:15Z) - Weakly Supervised Data Augmentation Through Prompting for Dialogue
Understanding [103.94325597273316]
We present a novel approach that iterates on augmentation quality by applying weakly-supervised filters.
We evaluate our methods on the emotion and act classification tasks in DailyDialog and the intent classification task in Facebook Multilingual Task-Oriented Dialogue.
For DailyDialog specifically, using 10% of the ground truth data we outperform the current state-of-the-art model which uses 100% of the data.
arXiv Detail & Related papers (2022-10-25T17:01:30Z) - What Did You Say? Task-Oriented Dialog Datasets Are Not Conversational!? [4.022057598291766]
We outline a taxonomy of conversational and contextual effects, which we use to examine MultiWOZ, SGD and SMCalFlow.
We find that less than 4% of MultiWOZ's turns and 10% of SGD's turns are conversational, while SMCalFlow is not conversational at all in its current release.
arXiv Detail & Related papers (2022-03-07T14:26:23Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer.
We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training.
Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.