Related papers: Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning

URL: http://arxiv.org/abs/2204.07302v1
Date: Fri, 15 Apr 2022 02:36:52 GMT
Title: Improving Cross-Modal Understanding in Visual Dialog via Contrastive Learning
Authors: Feilong Chen, Xiuyi Chen, Shuang Xu, Bo Xu
Abstract summary: We analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT. We propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU.
Score: 24.673262969986993
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's cross-modal understanding to handle a multi-turn visually-grounded conversation. Experiments show that the proposed approach improves the visual dialog model's cross-modal understanding and brings satisfactory gain to the VisDial dataset.

Related papers

Uncovering Hidden Connections: Iterative Search and Reasoning for Video-grounded Dialog [83.63849872250651]
Video-grounded dialog requires profound understanding of both dialog history and video content for accurate response generation. We present an iterative search and reasoning framework, which consists of a textual encoder, a visual encoder, and a generator.
arXiv Detail & Related papers (2023-10-11T07:37:13Z)
VD-PCR: Improving Visual Dialog with Pronoun Coreference Resolution [79.05412803762528]
The visual dialog task requires an AI agent to interact with humans in multi-round dialogs based on a visual environment. We propose VD-PCR, a novel framework to improve Visual Dialog understanding with Pronoun Coreference Resolution. With the proposed implicit and explicit methods, VD-PCR achieves state-of-the-art experimental results on the VisDial dataset.
arXiv Detail & Related papers (2022-05-29T15:29:50Z)
Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering. We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z)
SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning. We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z)
Modeling Explicit Concerning States for Reinforcement Learning in Visual Dialogue [43.42833961578857]
We propose Explicit Concerning States (ECS) to represent what visual contents are concerned at each round and what have been concerned throughout the Visual Dialogue. ECS is modeled from multimodal information and is represented explicitly. Based on ECS, we formulate two intuitive and interpretable rewards to encourage the Visual Dialogue agents to converse on diverse and informative visual information.
arXiv Detail & Related papers (2021-07-12T08:15:35Z)
Learning Reasoning Paths over Semantic Graphs for Video-grounded Dialogues [73.04906599884868]
We propose a novel framework of Reasoning Paths in Dialogue Context (PDC) PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer. Our model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer.
arXiv Detail & Related papers (2021-03-01T07:39:26Z)
ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation. A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally. Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
Multi-View Attention Network for Visual Dialog [5.731758300670842]
It is necessary for an agent to 1) determine the semantic intent of question and 2) align question-relevant textual and visual contents. We propose Multi-View Attention Network (MVAN), which leverages multiple views about heterogeneous inputs. MVAN effectively captures the question-relevant information from the dialog history with two complementary modules.
arXiv Detail & Related papers (2020-04-29T08:46:38Z)
VD-BERT: A Unified Vision and Dialog Transformer with BERT [161.0016161052714]
We propose VD-BERT, a simple yet effective framework of unified vision-dialog Transformer. We adapt BERT for the effective fusion of vision and dialog contents via visually grounded training. Our model yields new state of the art, achieving the top position in both single-model and ensemble settings.
arXiv Detail & Related papers (2020-04-28T04:08:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.