Knowledge-augmented Few-shot Visual Relation Detection
- URL: http://arxiv.org/abs/2303.05342v1
- Date: Thu, 9 Mar 2023 15:38:40 GMT
- Title: Knowledge-augmented Few-shot Visual Relation Detection
- Authors: Tianyu Yu, Yangning Li, Jiaoyan Chen, Yinghui Li, Hai-Tao Zheng, Xi
Chen, Qingbin Liu, Wenqiang Liu, Dongxiao Huang, Bei Wu, Yexin Wang
- Abstract summary: Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding.
Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance.
We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
- Score: 25.457693302327637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual Relation Detection (VRD) aims to detect relationships between objects
for image understanding. Most existing VRD methods rely on thousands of
training samples of each relationship to achieve satisfactory performance. Some
recent papers tackle this problem by few-shot learning with elaborately
designed pipelines and pre-trained word vectors. However, the performance of
existing few-shot VRD models is severely hampered by the poor generalization
capability, as they struggle to handle the vast semantic diversity of visual
relationships. Nonetheless, humans have the ability to learn new relationships
with just few examples based on their knowledge. Inspired by this, we devise a
knowledge-augmented, few-shot VRD framework leveraging both textual knowledge
and visual relation knowledge to improve the generalization ability of few-shot
VRD. The textual knowledge and visual relation knowledge are acquired from a
pre-trained language model and an automatically constructed visual relation
knowledge graph, respectively. We extensively validate the effectiveness of our
framework. Experiments conducted on three benchmarks from the commonly used
Visual Genome dataset show that our performance surpasses existing
state-of-the-art models with a large improvement.
Related papers
- RelVAE: Generative Pretraining for few-shot Visual Relationship
Detection [2.2230760534775915]
We present the first pretraining method for few-shot predicate classification that does not require any annotated relations.
We construct few-shot training splits and show quantitative experiments on VG200 and VRD datasets.
arXiv Detail & Related papers (2023-11-27T19:08:08Z) - Visual Commonsense based Heterogeneous Graph Contrastive Learning [79.22206720896664]
We propose a heterogeneous graph contrastive learning method to better finish the visual reasoning task.
Our method is designed as a plug-and-play way, so that it can be quickly and easily combined with a wide range of representative methods.
arXiv Detail & Related papers (2023-11-11T12:01:18Z) - Sample-Efficient Learning of Novel Visual Concepts [7.398195748292981]
State-of-the-art deep learning models struggle to recognize novel objects in a few-shot setting.
We show that incorporating a symbolic knowledge graph into a state-of-the-art recognition model enables a new approach for effective few-shot classification.
arXiv Detail & Related papers (2023-06-15T20:24:30Z) - Recognizing Unseen Objects via Multimodal Intensive Knowledge Graph
Propagation [68.13453771001522]
We propose a multimodal intensive ZSL framework that matches regions of images with corresponding semantic embeddings.
We conduct extensive experiments and evaluate our model on large-scale real-world data.
arXiv Detail & Related papers (2023-06-14T13:07:48Z) - Zero-shot Visual Relation Detection via Composite Visual Cues from Large
Language Models [44.60439935450292]
We propose a novel method for zero-shot visual recognition: RECODE.
It decomposes each predicate category into subject, object, and spatial components.
Different visual cues enhance the discriminability of similar relation categories from different perspectives.
arXiv Detail & Related papers (2023-05-21T14:40:48Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - One-shot Scene Graph Generation [130.57405850346836]
We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task.
Our method significantly outperforms existing state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-02-22T11:32:59Z) - Reasoning over Vision and Language: Exploring the Benefits of
Supplemental Knowledge [59.87823082513752]
This paper investigates the injection of knowledge from general-purpose knowledge bases (KBs) into vision-and-language transformers.
We empirically study the relevance of various KBs to multiple tasks and benchmarks.
The technique is model-agnostic and can expand the applicability of any vision-and-language transformer with minimal computational overhead.
arXiv Detail & Related papers (2021-01-15T08:37:55Z) - Visual Relationship Detection with Visual-Linguistic Knowledge from
Multimodal Representations [103.00383924074585]
Visual relationship detection aims to reason over relationships among salient objects in images.
We propose a novel approach named Visual-Linguistic Representations from Transformers (RVL-BERT)
RVL-BERT performs spatial reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training.
arXiv Detail & Related papers (2020-09-10T16:15:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.