External Knowledge Augmented Text Visual Question Answering
- URL: http://arxiv.org/abs/2108.09717v1
- Date: Sun, 22 Aug 2021 13:21:58 GMT
- Title: External Knowledge Augmented Text Visual Question Answering
- Authors: Arka Ujjal Dey, Ernest Valveny, Gaurav Harit
- Abstract summary: We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
- Score: 0.6445605125467573
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The open-ended question answering task of Text-VQA requires reading and
reasoning about local, often previously unseen, scene-text content of an image
to generate answers. In this work, we propose the generalized use of external
knowledge to augment our understanding of the said scene-text. We design a
framework to extract, filter, and encode knowledge atop a standard multimodal
transformer for vision language understanding tasks. Through empirical
evidence, we demonstrate how knowledge can highlight instance-only cues and
thus help deal with training data bias, improve answer entity type correctness,
and detect multiword named entities. We generate results comparable to the
state-of-the-art on two publicly available datasets, under the constraints of
similar upstream OCR systems and training data.
Related papers
- VLLMs Provide Better Context for Emotion Understanding Through Common Sense Reasoning [66.23296689828152]
We leverage the capabilities of Vision-and-Large-Language Models to enhance in-context emotion classification.
In the first stage, we propose prompting VLLMs to generate descriptions in natural language of the subject's apparent emotion.
In the second stage, the descriptions are used as contextual information and, along with the image input, are used to train a transformer-based architecture.
arXiv Detail & Related papers (2024-04-10T15:09:15Z) - Contextual Knowledge Pursuit for Faithful Visual Synthesis [33.191847768674826]
In large language models (LLMs), a prevalent strategy to reduce hallucinations is to retrieve factual knowledge from an external database.
This paper proposes Conparametric Knowledge Pursuit (CKPT), a framework that leverages the complementary strengths of external and parametric knowledge to help generators produce reliable visual content.
arXiv Detail & Related papers (2023-11-29T18:51:46Z) - DIVKNOWQA: Assessing the Reasoning Ability of LLMs via Open-Domain
Question Answering over Knowledge Base and Text [73.68051228972024]
Large Language Models (LLMs) have exhibited impressive generation capabilities, but they suffer from hallucinations when relying on their internal knowledge.
Retrieval-augmented LLMs have emerged as a potential solution to ground LLMs in external knowledge.
arXiv Detail & Related papers (2023-10-31T04:37:57Z) - Combo of Thinking and Observing for Outside-Knowledge VQA [13.838435454270014]
Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge.
In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space.
We propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder.
arXiv Detail & Related papers (2023-05-10T18:32:32Z) - TegTok: Augmenting Text Generation via Task-specific and Open-world
Knowledge [83.55215993730326]
We propose augmenting TExt Generation via Task-specific and Open-world Knowledge (TegTok) in a unified framework.
Our model selects knowledge entries from two types of knowledge sources through dense retrieval and then injects them into the input encoding and output decoding stages respectively.
arXiv Detail & Related papers (2022-03-16T10:37:59Z) - Open Domain Question Answering over Virtual Documents: A Unified
Approach for Data and Text [62.489652395307914]
We use the data-to-text method as a means for encoding structured knowledge for knowledge-intensive applications, i.e. open-domain question answering (QA)
Specifically, we propose a verbalizer-retriever-reader framework for open-domain QA over data and text where verbalized tables from Wikipedia and triples from Wikidata are used as augmented knowledge sources.
We show that our Unified Data and Text QA, UDT-QA, can effectively benefit from the expanded knowledge index, leading to large gains over text-only baselines.
arXiv Detail & Related papers (2021-10-16T00:11:21Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - Improving Machine Reading Comprehension with Contextualized Commonsense
Knowledge [62.46091695615262]
We aim to extract commonsense knowledge to improve machine reading comprehension.
We propose to represent relations implicitly by situating structured knowledge in a context.
We employ a teacher-student paradigm to inject multiple types of contextualized knowledge into a student machine reader.
arXiv Detail & Related papers (2020-09-12T17:20:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.