Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
- URL: http://arxiv.org/abs/2204.04680v1
- Date: Sun, 10 Apr 2022 13:12:10 GMT
- Title: Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog
- Authors: Shunyu Zhang, Xiaoze Jiang, Zequn Yang, Tao Wan, Zengchang Qin
- Abstract summary: We propose a novel model by Reasoning with Multi-structure Commonsense Knowledge (RMK)
In our model, the external knowledge is represented with sentence-level facts and graph-level facts.
On top of these multi-structure representations, our model can capture relevant knowledge and incorporate them into the vision and semantic features.
- Score: 12.034554338597067
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Dialog requires an agent to engage in a conversation with humans
grounded in an image. Many studies on Visual Dialog focus on the understanding
of the dialog history or the content of an image, while a considerable amount
of commonsense-required questions are ignored. Handling these scenarios depends
on logical reasoning that requires commonsense priors. How to capture relevant
commonsense knowledge complementary to the history and the image remains a key
challenge. In this paper, we propose a novel model by Reasoning with
Multi-structure Commonsense Knowledge (RMK). In our model, the external
knowledge is represented with sentence-level facts and graph-level facts, to
properly suit the scenario of the composite of dialog history and image. On top
of these multi-structure representations, our model can capture relevant
knowledge and incorporate them into the vision and semantic features, via
graph-based interaction and transformer-based fusion. Experimental results and
analysis on VisDial v1.0 and VisDialCK datasets show that our proposed model
effectively outperforms comparative methods.
Related papers
- ReSee: Responding through Seeing Fine-grained Visual Knowledge in
Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity.
To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset.
By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z) - Modeling Coreference Relations in Visual Dialog [18.926582410644375]
The occurrences of coreference relations in the dialog makes it a more challenging task than visual question-answering.
We propose two soft constraints that can improve the model's ability of resolving coreferences in dialog in an unsupervised way.
arXiv Detail & Related papers (2022-03-06T15:22:24Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Matching Visual Features to Hierarchical Semantic Topics for Image
Paragraph Captioning [50.08729005865331]
This paper develops a plug-and-play hierarchical-topic-guided image paragraph generation framework.
To capture the correlations between the image and text at multiple levels of abstraction, we design a variational inference network.
To guide the paragraph generation, the learned hierarchical topics and visual features are integrated into the language model.
arXiv Detail & Related papers (2021-05-10T06:55:39Z) - Learning Reasoning Paths over Semantic Graphs for Video-grounded
Dialogues [73.04906599884868]
We propose a novel framework of Reasoning Paths in Dialogue Context (PDC)
PDC model discovers information flows among dialogue turns through a semantic graph constructed based on lexical components in each question and answer.
Our model sequentially processes both visual and textual information through this reasoning path and the propagated features are used to generate the answer.
arXiv Detail & Related papers (2021-03-01T07:39:26Z) - Ranking Enhanced Dialogue Generation [77.8321855074999]
How to effectively utilize the dialogue history is a crucial problem in multi-turn dialogue generation.
Previous works usually employ various neural network architectures to model the history.
This paper proposes a Ranking Enhanced Dialogue generation framework.
arXiv Detail & Related papers (2020-08-13T01:49:56Z) - KBGN: Knowledge-Bridge Graph Network for Adaptive Vision-Text Reasoning
in Visual Dialogue [17.119682693725718]
We propose a novel Knowledge-Bridge Graph Network (KBGN) model to bridge the cross-modal semantic relations between vision and text knowledge.
We show that our model outperforms existing models with state-of-the-art results.
arXiv Detail & Related papers (2020-08-11T17:03:06Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z) - Modality-Balanced Models for Visual Dialogue [102.35406085738325]
The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.
We show that previous joint-modality (history and image) models over-rely on and are more prone to memorizing the dialogue history.
We present methods for this integration of the two models, via ensemble and consensus dropout fusion with shared parameters.
arXiv Detail & Related papers (2020-01-17T14:57:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.