One-shot Scene Graph Generation
- URL: http://arxiv.org/abs/2202.10824v1
- Date: Tue, 22 Feb 2022 11:32:59 GMT
- Title: One-shot Scene Graph Generation
- Authors: Yuyu Guo, Jingkuan Song, Lianli Gao, Heng Tao Shen
- Abstract summary: We propose Multiple Structured Knowledge (Relational Knowledgesense Knowledge) for the one-shot scene graph generation task.
Our method significantly outperforms existing state-of-the-art methods by a large margin.
- Score: 130.57405850346836
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a structured representation of the image content, the visual scene graph
(visual relationship) acts as a bridge between computer vision and natural
language processing. Existing models on the scene graph generation task
notoriously require tens or hundreds of labeled samples. By contrast, human
beings can learn visual relationships from a few or even one example. Inspired
by this, we design a task named One-Shot Scene Graph Generation, where each
relationship triplet (e.g., "dog-has-head") comes from only one labeled
example. The key insight is that rather than learning from scratch, one can
utilize rich prior knowledge. In this paper, we propose Multiple Structured
Knowledge (Relational Knowledge and Commonsense Knowledge) for the one-shot
scene graph generation task. Specifically, the Relational Knowledge represents
the prior knowledge of relationships between entities extracted from the visual
content, e.g., the visual relationships "standing in", "sitting in", and "lying
in" may exist between "dog" and "yard", while the Commonsense Knowledge encodes
"sense-making" knowledge like "dog can guard yard". By organizing these two
kinds of knowledge in a graph structure, Graph Convolution Networks (GCNs) are
used to extract knowledge-embedded semantic features of the entities. Besides,
instead of extracting isolated visual features from each entity generated by
Faster R-CNN, we utilize an Instance Relation Transformer encoder to fully
explore their context information. Based on a constructed one-shot dataset, the
experimental results show that our method significantly outperforms existing
state-of-the-art methods by a large margin. Ablation studies also verify the
effectiveness of the Instance Relation Transformer encoder and the Multiple
Structured Knowledge.
Related papers
- Knowledge-augmented Few-shot Visual Relation Detection [25.457693302327637]
Visual Relation Detection (VRD) aims to detect relationships between objects for image understanding.
Most existing VRD methods rely on thousands of training samples of each relationship to achieve satisfactory performance.
We devise a knowledge-augmented, few-shot VRD framework leveraging both textual knowledge and visual relation knowledge.
arXiv Detail & Related papers (2023-03-09T15:38:40Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph [96.95815946327079]
It is difficult to learn the association between named entities and visual cues due to the long-tail distribution of named entities.
We propose a novel approach that constructs a multi-modal knowledge graph to associate the visual objects with named entities.
arXiv Detail & Related papers (2021-07-26T05:50:41Z) - Entity Context Graph: Learning Entity Representations
fromSemi-Structured Textual Sources on the Web [44.92858943475407]
We propose an approach that processes entity centric textual knowledge sources to learn entity embeddings.
We show that the embeddings learned from our approach are: (i) high quality and comparable to a known knowledge graph-based embeddings and can be used to improve them further.
arXiv Detail & Related papers (2021-03-29T20:52:14Z) - Learning Graph Embeddings for Compositional Zero-shot Learning [73.80007492964951]
In compositional zero-shot learning, the goal is to recognize unseen compositions of observed visual primitives states.
We propose a novel graph formulation called Compositional Graph Embedding (CGE) that learns image features and latent representations of visual primitives in an end-to-end manner.
By learning a joint compatibility that encodes semantics between concepts, our model allows for generalization to unseen compositions without relying on an external knowledge base like WordNet.
arXiv Detail & Related papers (2021-02-03T10:11:03Z) - Learning to Represent Image and Text with Denotation Graph [32.417311523031195]
We propose learning representations from a set of implied, visually grounded expressions between image and text.
We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations.
arXiv Detail & Related papers (2020-10-06T18:00:58Z) - Exploiting Structured Knowledge in Text via Graph-Guided Representation
Learning [73.0598186896953]
We present two self-supervised tasks learning over raw text with the guidance from knowledge graphs.
Building upon entity-level masked language models, our first contribution is an entity masking scheme.
In contrast to existing paradigms, our approach uses knowledge graphs implicitly, only during pre-training.
arXiv Detail & Related papers (2020-04-29T14:22:42Z) - Bridging Knowledge Graphs to Generate Scene Graphs [49.69377653925448]
We propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them.
Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs.
arXiv Detail & Related papers (2020-01-07T23:35:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.