Combo of Thinking and Observing for Outside-Knowledge VQA
- URL: http://arxiv.org/abs/2305.06407v1
- Date: Wed, 10 May 2023 18:32:32 GMT
- Title: Combo of Thinking and Observing for Outside-Knowledge VQA
- Authors: Qingyi Si, Yuchen Mo, Zheng Lin, Huishan Ji, Weiping Wang
- Abstract summary: Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge.
In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space.
We propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder.
- Score: 13.838435454270014
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Outside-knowledge visual question answering is a challenging task that
requires both the acquisition and the use of open-ended real-world knowledge.
Some existing solutions draw external knowledge into the cross-modality space
which overlooks the much vaster textual knowledge in natural-language space,
while others transform the image into a text that further fuses with the
textual knowledge into the natural-language space and completely abandons the
use of visual features. In this paper, we are inspired to constrain the
cross-modality space into the same space of natural-language space which makes
the visual features preserved directly, and the model still benefits from the
vast knowledge in natural-language space. To this end, we propose a novel
framework consisting of a multimodal encoder, a textual encoder and an answer
decoder. Such structure allows us to introduce more types of knowledge
including explicit and implicit multimodal and textual knowledge. Extensive
experiments validate the superiority of the proposed method which outperforms
the state-of-the-art by 6.17% accuracy. We also conduct comprehensive ablations
of each component, and systematically study the roles of varying types of
knowledge. Codes and knowledge data can be found at
https://github.com/PhoebusSi/Thinking-while-Observing.
Related papers
- Open Visual Knowledge Extraction via Relation-Oriented Multimodality
Model Prompting [89.95541601837719]
We take a first exploration to a new paradigm of open visual knowledge extraction.
OpenVik consists of an open relational region detector to detect regions potentially containing relational knowledge.
A visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest.
arXiv Detail & Related papers (2023-10-28T20:09:29Z) - Multimodal Dialog Systems with Dual Knowledge-enhanced Generative Pretrained Language Model [63.461030694700014]
We propose a novel dual knowledge-enhanced generative pretrained language model for multimodal task-oriented dialog systems (DKMD)
The proposed DKMD consists of three key components: dual knowledge selection, dual knowledge-enhanced context learning, and knowledge-enhanced response generation.
Experiments on a public dataset verify the superiority of the proposed DKMD over state-of-the-art competitors.
arXiv Detail & Related papers (2022-07-16T13:02:54Z) - Imagination-Augmented Natural Language Understanding [71.51687221130925]
We introduce an Imagination-Augmented Cross-modal (iACE) to solve natural language understanding tasks.
iACE enables visual imagination with external knowledge transferred from the powerful generative and pre-trained vision-and-language models.
Experiments on GLUE and SWAG show that iACE achieves consistent improvement over visually-supervised pre-trained models.
arXiv Detail & Related papers (2022-04-18T19:39:36Z) - A Thousand Words Are Worth More Than a Picture: Natural Language-Centric
Outside-Knowledge Visual Question Answering [47.1063091195119]
We call for a paradigm shift for the OK-VQA task, which transforms the image into plain text.
A Transform-Retrieve-Generate framework (TRiG) is proposed, which can be plug-and-played with alternative image-to-text models.
Experimental results show that our TRiG framework outperforms all state-of-the-art supervised methods by at least 11.1% absolute margin.
arXiv Detail & Related papers (2022-01-14T04:12:46Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z) - External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
arXiv Detail & Related papers (2021-08-22T13:21:58Z) - Learning Zero-Shot Multifaceted Visually Grounded Word Embeddingsvia
Multi-Task Training [8.271859911016719]
Language grounding aims at linking the symbolic representation of language (e.g., words) into the rich perceptual knowledge of the outside world.
We argue that this approach sacrifices the abstract knowledge obtained from linguistic co-occurrence statistics in the process of acquiring perceptual information.
arXiv Detail & Related papers (2021-04-15T14:49:11Z) - Contextualized Knowledge-aware Attentive Neural Network: Enhancing
Answer Selection with Knowledge [77.77684299758494]
We extensively investigate approaches to enhancing the answer selection model with external knowledge from knowledge graph (KG)
First, we present a context-knowledge interaction learning framework, Knowledge-aware Neural Network (KNN), which learns the QA sentence representations by considering a tight interaction with the external knowledge from KG and the textual information.
To handle the diversity and complexity of KG information, we propose a Contextualized Knowledge-aware Attentive Neural Network (CKANN), which improves the knowledge representation learning with structure information via a customized Graph Convolutional Network (GCN) and comprehensively learns context-based and knowledge-based sentence representation via
arXiv Detail & Related papers (2021-04-12T05:52:20Z) - Improving Disentangled Text Representation Learning with
Information-Theoretic Guidance [99.68851329919858]
discrete nature of natural language makes disentangling of textual representations more challenging.
Inspired by information theory, we propose a novel method that effectively manifests disentangled representations of text.
Experiments on both conditional text generation and text-style transfer demonstrate the high quality of our disentangled representation.
arXiv Detail & Related papers (2020-06-01T03:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.