Open Visual Knowledge Extraction via Relation-Oriented Multimodality
Model Prompting
- URL: http://arxiv.org/abs/2310.18804v1
- Date: Sat, 28 Oct 2023 20:09:29 GMT
- Title: Open Visual Knowledge Extraction via Relation-Oriented Multimodality
Model Prompting
- Authors: Hejie Cui, Xinyu Fang, Zihan Zhang, Ran Xu, Xuan Kan, Xin Liu, Yue Yu,
Manling Li, Yangqiu Song, Carl Yang
- Abstract summary: We take a first exploration to a new paradigm of open visual knowledge extraction.
OpenVik consists of an open relational region detector to detect regions potentially containing relational knowledge.
A visual knowledge generator that generates format-free knowledge by prompting the large multimodality model with the detected region of interest.
- Score: 89.95541601837719
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Images contain rich relational knowledge that can help machines understand
the world. Existing methods on visual knowledge extraction often rely on the
pre-defined format (e.g., sub-verb-obj tuples) or vocabulary (e.g., relation
types), restricting the expressiveness of the extracted knowledge. In this
work, we take a first exploration to a new paradigm of open visual knowledge
extraction. To achieve this, we present OpenVik which consists of an open
relational region detector to detect regions potentially containing relational
knowledge and a visual knowledge generator that generates format-free knowledge
by prompting the large multimodality model with the detected region of
interest. We also explore two data enhancement techniques for diversifying the
generated format-free visual knowledge. Extensive knowledge quality evaluations
highlight the correctness and uniqueness of the extracted open visual knowledge
by OpenVik. Moreover, integrating our extracted knowledge across various visual
reasoning applications shows consistent improvements, indicating the real-world
applicability of OpenVik.
Related papers
- Knowledge Graph Extension by Entity Type Recognition [2.8231106019727195]
We propose a novel knowledge graph extension framework based on entity type recognition.
The framework aims to achieve high-quality knowledge extraction by aligning the schemas and entities across different knowledge graphs.
arXiv Detail & Related papers (2024-05-03T19:55:03Z) - Visual Knowledge in the Big Model Era: Retrospect and Prospect [63.282425615863]
Visual knowledge is a new form of knowledge representation that can encapsulate visual concepts and their relations in a succinct, comprehensive, and interpretable manner.
As the knowledge about the visual world has been identified as an indispensable component of human cognition and intelligence, visual knowledge is poised to have a pivotal role in establishing machine intelligence.
arXiv Detail & Related papers (2024-04-05T07:31:24Z) - Percept, Chat, and then Adapt: Multimodal Knowledge Transfer of
Foundation Models for Open-World Video Recognition [36.56176821492121]
We propose a generic knowledge transfer pipeline to boost open-world video recognition.
We name it PCA, based on three stages of Percept, Chat, and Adapt.
Our approach achieves state-of-the-art performance on all three datasets.
arXiv Detail & Related papers (2024-02-29T08:29:03Z) - A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering [53.70661720114377]
multimodal large models (MLMs) have significantly advanced the field of visual understanding, offering remarkable capabilities in realm of visual question answering (VQA)
Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate deep comprehension of the visual information in conjunction with a vast repository of learned knowledge.
To uncover such capabilities, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing
arXiv Detail & Related papers (2023-11-13T18:22:32Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - ReSee: Responding through Seeing Fine-grained Visual Knowledge in
Open-domain Dialogue [34.223466503256766]
We provide a new paradigm of constructing multimodal dialogues by splitting visual knowledge into finer granularity.
To boost the accuracy and diversity of augmented visual information, we retrieve them from the Internet or a large image dataset.
By leveraging text and vision knowledge, ReSee can produce informative responses with real-world visual concepts.
arXiv Detail & Related papers (2023-05-23T02:08:56Z) - Combo of Thinking and Observing for Outside-Knowledge VQA [13.838435454270014]
Outside-knowledge visual question answering is a challenging task that requires both the acquisition and the use of open-ended real-world knowledge.
In this paper, we are inspired to constrain the cross-modality space into the same space of natural-language space.
We propose a novel framework consisting of a multimodal encoder, a textual encoder and an answer decoder.
arXiv Detail & Related papers (2023-05-10T18:32:32Z) - UNTER: A Unified Knowledge Interface for Enhancing Pre-trained Language
Models [100.4659557650775]
We propose a UNified knowledge inTERface, UNTER, to provide a unified perspective to exploit both structured knowledge and unstructured knowledge.
With both forms of knowledge injected, UNTER gains continuous improvements on a series of knowledge-driven NLP tasks.
arXiv Detail & Related papers (2023-05-02T17:33:28Z) - Knowledge Graph Augmented Network Towards Multiview Representation
Learning for Aspect-based Sentiment Analysis [96.53859361560505]
We propose a knowledge graph augmented network (KGAN) to incorporate external knowledge with explicitly syntactic and contextual information.
KGAN captures the sentiment feature representations from multiple perspectives, i.e., context-, syntax- and knowledge-based.
Experiments on three popular ABSA benchmarks demonstrate the effectiveness and robustness of our KGAN.
arXiv Detail & Related papers (2022-01-13T08:25:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.