Few-shot Visual Relationship Co-localization
- URL: http://arxiv.org/abs/2108.11618v1
- Date: Thu, 26 Aug 2021 07:19:57 GMT
- Title: Few-shot Visual Relationship Co-localization
- Authors: Revant Teotia, Vaibhav Mishra, Mayank Maheshwari, Anand Mishra
- Abstract summary: Given a small bag of images, each containing a common but latent predicate, we are interested in localizing visual subject-object pairs connected via the common predicate in each of the images.
We propose an optimization framework to select a common visual relationship in each image of the bag.
We extensively evaluate our proposed framework on variations of bag sizes obtained from two challenging public datasets.
- Score: 1.4130726713527195
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, given a small bag of images, each containing a common but
latent predicate, we are interested in localizing visual subject-object pairs
connected via the common predicate in each of the images. We refer to this
novel problem as visual relationship co-localization or VRC as an abbreviation.
VRC is a challenging task, even more so than the well-studied object
co-localization task. This becomes further challenging when using just a few
images, the model has to learn to co-localize visual subject-object pairs
connected via unseen predicates. To solve VRC, we propose an optimization
framework to select a common visual relationship in each image of the bag. The
goal of the optimization framework is to find the optimal solution by learning
visual relationship similarity across images in a few-shot setting. To obtain
robust visual relationship representation, we utilize a simple yet effective
technique that learns relationship embedding as a translation vector from
visual subject to visual object in a shared space. Further, to learn visual
relationship similarity, we utilize a proven meta-learning technique commonly
used for few-shot classification tasks. Finally, to tackle the combinatorial
complexity challenge arising from an exponential number of feasible solutions,
we use a greedy approximation inference algorithm that selects approximately
the best solution.
We extensively evaluate our proposed framework on variations of bag sizes
obtained from two challenging public datasets, namely VrR-VG and VG-150, and
achieve impressive visual co-localization performance.
Related papers
- Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection [14.22646492640906]
We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection.
Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly.
Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds.
arXiv Detail & Related papers (2024-03-21T10:15:57Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Learning-based Relational Object Matching Across Views [63.63338392484501]
We propose a learning-based approach which combines local keypoints with novel object-level features for matching object detections between RGB images.
We train our object-level matching features based on appearance and inter-frame and cross-frame spatial relations between objects in an associative graph neural network.
arXiv Detail & Related papers (2023-05-03T19:36:51Z) - Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images.
We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM)
The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z) - Two-stage Visual Cues Enhancement Network for Referring Image
Segmentation [89.49412325699537]
Referring Image (RIS) aims at segmenting the target object from an image referred by one given natural language expression.
In this paper, we tackle this problem by devising a Two-stage Visual cues enhancement Network (TV-Net)
Through the two-stage enhancement, our proposed TV-Net enjoys better performances in learning fine-grained matching behaviors between the natural language expression and image.
arXiv Detail & Related papers (2021-10-09T02:53:39Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - Analysis on Image Set Visual Question Answering [0.3359875577705538]
We tackle the challenge of Visual Question Answering in multi-image setting.
Traditional VQA tasks have focused on a single-image setting where the target answer is generated from a single image.
In this report, we work with 4 approaches in a bid to improve the performance on the task.
arXiv Detail & Related papers (2021-03-31T20:47:32Z) - Tensor Composition Net for Visual Relationship Prediction [115.14829858763399]
We present a novel Composition Network (TCN) to predict visual relationships in images.
The key idea of our TCN is to exploit the low rank property of the visual relationship tensor.
We show our TCN's image-level visual relationship prediction provides a simple and efficient mechanism for relation-based image retrieval.
arXiv Detail & Related papers (2020-12-10T06:27:20Z) - Multi-Modal Retrieval using Graph Neural Networks [1.8911962184174562]
We learn a joint vision and concept embedding in the same high-dimensional space.
We model the visual and concept relationships as a graph structure.
We also introduce a novel inference time control, based on selective neighborhood connectivity.
arXiv Detail & Related papers (2020-10-04T19:34:20Z) - ORD: Object Relationship Discovery for Visual Dialogue Generation [60.471670447176656]
We propose an object relationship discovery (ORD) framework to preserve the object interactions for visual dialogue generation.
A hierarchical graph convolutional network (HierGCN) is proposed to retain the object nodes and neighbour relationships locally, and then refines the object-object connections globally.
Experiments have proved that the proposed method can significantly improve the quality of dialogue by utilising the contextual information of visual relationships.
arXiv Detail & Related papers (2020-06-15T12:25:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.