Tensor Composition Net for Visual Relationship Prediction
- URL: http://arxiv.org/abs/2012.05473v1
- Date: Thu, 10 Dec 2020 06:27:20 GMT
- Title: Tensor Composition Net for Visual Relationship Prediction
- Authors: Yuting Qiang, Yongxin Yang, Yanwen Guo and Timothy M. Hospedales
- Abstract summary: We present a novel Composition Network (TCN) to predict visual relationships in images.
The key idea of our TCN is to exploit the low rank property of the visual relationship tensor.
We show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
- Score: 115.14829858763399
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a novel Tensor Composition Network (TCN) to predict visual
relationships in images. Visual Relationships in subject-predicate-object form
provide a more powerful query modality than simple image tags. However Visual
Relationship Prediction (VRP) also provides a more challenging test of image
understanding than conventional image tagging, and is difficult to learn due to
a large label-space and incomplete annotation. The key idea of our TCN is to
exploit the low rank property of the visual relationship tensor, so as to
leverage correlations within and across objects and relationships, and make a
structured prediction of all objects and their relations in an image. To show
the effectiveness of our method, we first empirically compare our model with
multi-label classification alternatives on VRP, and show that our model
outperforms state-of-the-art MLIC methods. We then show that, thanks to our
tensor (de)composition layer, our model can predict visual relationships which
have not been seen in training dataset. We finally show our TCN's image-level
visual relationship prediction provides a simple and efficient mechanism for
relation-based image retrieval.
Related papers
- Composing Object Relations and Attributes for Image-Text Matching [70.47747937665987]
This work introduces a dual-encoder image-text matching model, leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges.
Our model efficiently encodes object-attribute and object-object semantic relations, resulting in a robust and fast-performing system.
arXiv Detail & Related papers (2024-06-17T17:56:01Z) - Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Correlational Image Modeling for Self-Supervised Visual Pre-Training [81.82907503764775]
Correlational Image Modeling is a novel and surprisingly effective approach to self-supervised visual pre-training.
Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task.
arXiv Detail & Related papers (2023-03-22T15:48:23Z) - Detecting Objects with Context-Likelihood Graphs and Graph Refinement [45.70356990655389]
The goal of this paper is to detect objects by exploiting their ins. Contrary to existing methods, which learn objects and relations separately, our key idea is to learn the object-relation distribution jointly.
We propose a novel way of creating a graphical representation of an image from inter-object relations and initial class predictions, we call a context-likelihood graph.
We then learn the joint with an energy-based modeling technique which allows a sample and refine the context-likelihood graph iteratively for a given image.
arXiv Detail & Related papers (2022-12-23T15:27:21Z) - Relational Embedding for Few-Shot Classification [32.12002195421671]
We propose to address the problem of few-shot classification by meta-learning "what to observe" and "where to attend" in a relational perspective.
Our method leverages patterns within and between images via self-correlational representation (SCR) and cross-correlational attention (CCA)
Our Embedding Network (RENet) combines the two relational modules to learn relational embedding in an end-to-end manner.
arXiv Detail & Related papers (2021-08-22T08:44:55Z) - Unified Graph Structured Models for Video Understanding [93.72081456202672]
We propose a message passing graph neural network that explicitly models relational-temporal relations.
We show how our method is able to more effectively model relationships between relevant entities in the scene.
arXiv Detail & Related papers (2021-03-29T14:37:35Z) - Group-Wise Semantic Mining for Weakly Supervised Semantic Segmentation [49.90178055521207]
This work addresses weakly supervised semantic segmentation (WSSS), with the goal of bridging the gap between image-level annotations and pixel-level segmentation.
We formulate WSSS as a novel group-wise learning task that explicitly models semantic dependencies in a group of images to estimate more reliable pseudo ground-truths.
In particular, we devise a graph neural network (GNN) for group-wise semantic mining, wherein input images are represented as graph nodes.
arXiv Detail & Related papers (2020-12-09T12:40:13Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Explanation-based Weakly-supervised Learning of Visual Relations with
Graph Networks [7.199745314783952]
This paper introduces a novel weakly-supervised method for visual relationship detection that relies on minimal image-level predicate labels.
A graph neural network is trained to classify predicates in images from a graph representation of detected objects, implicitly encoding an inductive bias for pairwise relations.
We present results comparable to recent fully- and weakly-supervised methods on three diverse and challenging datasets.
arXiv Detail & Related papers (2020-06-16T23:14:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.