Table and Image Generation for Investigating Knowledge of Entities in
Pre-trained Vision and Language Models
- URL: http://arxiv.org/abs/2306.02115v2
- Date: Wed, 26 Jul 2023 02:20:30 GMT
- Title: Table and Image Generation for Investigating Knowledge of Entities in
Pre-trained Vision and Language Models
- Authors: Hidetaka Kamigaito, Katsuhiko Hayashi, Taro Watanabe
- Abstract summary: We propose a task to verify how knowledge about entities acquired from natural language is retained in Vision & Language (V&L) models.
This task consists of two parts: the first is to generate a table containing knowledge about an entity and its related image, and the second is to generate an image from an entity with a caption.
We created the Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000 infoboxes in English Wikipedia articles to perform the proposed tasks.
- Score: 31.865208971014336
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a table and image generation task to verify how the
knowledge about entities acquired from natural language is retained in Vision &
Language (V&L) models. This task consists of two parts: the first is to
generate a table containing knowledge about an entity and its related image,
and the second is to generate an image from an entity with a caption and a
table containing related knowledge of the entity. In both tasks, the model must
know the entities used to perform the generation properly. We created the
Wikipedia Table and Image Generation (WikiTIG) dataset from about 200,000
infoboxes in English Wikipedia articles to perform the proposed tasks. We
evaluated the performance on the tasks with respect to the above research
question using the V&L model OFA, which has achieved state-of-the-art results
in multiple tasks. Experimental results show that OFA forgets part of its
entity knowledge by pre-training as a complement to improve the performance of
image related tasks.
Related papers
- VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists [66.85125112199898]
We develop a unified language interface for computer vision tasks that abstracts away task-specific design choices.
Our model, dubbed InstructCV, performs competitively compared to other generalist and task-specific vision models.
arXiv Detail & Related papers (2023-09-30T14:26:43Z) - "Let's not Quote out of Context": Unified Vision-Language Pretraining
for Context Assisted Image Captioning [40.01197694624958]
We propose a new unified Vision-Language (VL) model based on the One For All (OFA) model.
Our approach aims to overcome the context-independent (image and text are treated independently) nature of the existing approaches.
Our system achieves state-of-the-art results with an improvement of up to 8.34 CIDEr score on the benchmark news image captioning datasets.
arXiv Detail & Related papers (2023-06-01T17:34:25Z) - Named Entity and Relation Extraction with Multi-Modal Retrieval [51.660650522630526]
Multi-modal named entity recognition (NER) and relation extraction (RE) aim to leverage relevant image information to improve the performance of NER and RE.
We propose a novel Multi-modal Retrieval based framework (MoRe)
MoRe contains a text retrieval module and an image-based retrieval module, which retrieve related knowledge of the input text and image in the knowledge corpus respectively.
arXiv Detail & Related papers (2022-12-03T13:11:32Z) - Visual Named Entity Linking: A New Dataset and A Baseline [61.38231023490981]
We consider a purely Visual-based Named Entity Linking (VNEL) task, where the input only consists of an image.
We propose three different sub-tasks, i.e., visual to visual entity linking (V2VEL), visual to textual entity linking (V2TEL), and visual to visual-textual entity linking (V2VTEL)
We present a high-quality human-annotated visual person linking dataset, named WIKIPerson.
arXiv Detail & Related papers (2022-11-09T13:27:50Z) - Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.