Reasoning over Vision and Language: Exploring the Benefits of
  Supplemental Knowledge
        - URL: http://arxiv.org/abs/2101.06013v1
- Date: Fri, 15 Jan 2021 08:37:55 GMT
- Title: Reasoning over Vision and Language: Exploring the Benefits of
  Supplemental Knowledge
- Authors: Violetta Shevchenko, Damien Teney, Anthony Dick, Anton van den Hengel
- Abstract summary: This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
- Score: 59.87823082513752
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   The limits of applicability of vision-and-language models are defined by the
coverage of their training data. Tasks like vision question answering (VQA)
often require commonsense and factual information beyond what can be learned
from task-specific datasets. This paper investigates the injection of knowledge
from general-purpose knowledge bases (KBs) into vision-and-language
transformers. We use an auxiliary training objective that encourages the
learned representations to align with graph embeddings of matching entities in
a KB. We empirically study the relevance of various KBs to multiple tasks and
benchmarks. The technique brings clear benefits to knowledge-demanding question
answering tasks (OK-VQA, FVQA) by capturing semantic and relational knowledge
absent from existing models. More surprisingly, the technique also benefits
visual reasoning tasks (NLVR2, SNLI-VE). We perform probing experiments and
show that the injection of additional knowledge regularizes the space of
embeddings, which improves the representation of lexical and semantic
similarities. The technique is model-agnostic and can expand the applicability
of any vision-and-language transformer with minimal computational overhead.
 
      
        Related papers
        - VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework [8.629074194407611]
 Visual reasoning refers to the task of solving questions about visual information.
We propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework) for visual reasoning tasks.
 arXiv  Detail & Related papers  (2025-02-02T07:54:55Z)
- XCoOp: Explainable Prompt Learning for Computer-Aided Diagnosis via   Concept-guided Context Optimization [4.634780391920529]
 We propose a novel explainable prompt learning framework that leverages medical knowledge by aligning the semantics of images, learnable prompts, and clinical concept-driven prompts.
Our framework addresses the lack of valuable concept annotations by eliciting knowledge from large language models.
Our method simultaneously achieves superior diagnostic performance, flexibility, and interpretability, shedding light on the effectiveness of foundation models in facilitating XAI.
 arXiv  Detail & Related papers  (2024-03-14T14:02:01Z)
- Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
 We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
 arXiv  Detail & Related papers  (2023-08-22T04:24:45Z)
- Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
  Propagation [68.13453771001522]
 We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
 arXiv  Detail & Related papers  (2023-06-14T13:07:48Z)
- VIPHY: Probing "Visible" Physical Commonsense Knowledge [22.00069189468524]
 Vision-language models (VLMs) have shown remarkable performance on visual reasoning tasks.
We evaluate their ability to acquire "visible" physical knowledge.
Our results indicate a severe gap between model and human performance.
 arXiv  Detail & Related papers  (2022-09-15T02:06:25Z)
- REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual
  Question Answering [75.53187719777812]
 This paper revisits visual representation in knowledge-based visual question answering (VQA)
We propose a new knowledge-based VQA method REVIVE, which tries to utilize the explicit information of object regions.
We achieve new state-of-the-art performance, i.e., 58.0% accuracy, surpassing previous state-of-the-art method by a large margin.
 arXiv  Detail & Related papers  (2022-06-02T17:59:56Z)
- K-LITE: Learning Transferable Visual Models with External Knowledge [242.3887854728843]
 K-LITE (Knowledge-augmented Language-Image Training and Evaluation) is a strategy to leverage external knowledge to build transferable visual systems.
In training, it enriches entities in natural language with WordNet and Wiktionary knowledge.
In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts.
 arXiv  Detail & Related papers  (2022-04-20T04:47:01Z)
- Leveraging Visual Knowledge in Language Tasks: An Empirical Study on
  Intermediate Pre-training for Cross-modal Knowledge Transfer [61.34424171458634]
 We study whether integrating visual knowledge into a language model can fill the gap.
Our experiments show that visual knowledge transfer can improve performance in both low-resource and fully supervised settings.
 arXiv  Detail & Related papers  (2022-03-14T22:02:40Z)
- Improving and Diagnosing Knowledge-Based Visual Question Answering via
  Entity Enhanced Knowledge Injection [14.678153928301493]
 Knowledge-Based Visual Question Answering (KBVQA) is a bi-modal task requiring external world knowledge in order to correctly answer a text question and associated image.
Recent single text work has shown knowledge injection into pre-trained language models, specifically entity enhanced knowledge graph embeddings, can improve performance on downstream entity-centric tasks.
 arXiv  Detail & Related papers  (2021-12-13T18:45:42Z)
- External Knowledge Augmented Text Visual Question Answering [0.6445605125467573]
 We propose a framework to extract, filter, and encode knowledge atop a standard multimodal transformer for vision language understanding tasks.
We generate results comparable to the state-of-the-art on two publicly available datasets.
 arXiv  Detail & Related papers  (2021-08-22T13:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.