Scene Graph Embeddings Using Relative Similarity Supervision
- URL: http://arxiv.org/abs/2104.02381v1
- Date: Tue, 6 Apr 2021 09:13:05 GMT
- Title: Scene Graph Embeddings Using Relative Similarity Supervision
- Authors: Paridhi Maheshwari, Ritwick Chaudhry, Vishwa Vinay
- Abstract summary: We employ a graph convolutional network to exploit structure in scene graphs and produce image embeddings useful for semantic image retrieval.
We propose a novel loss function that operates on pairs of similar and dissimilar images and imposes relative ordering between them in embedding space.
We demonstrate that this Ranking loss, coupled with an intuitive triple sampling strategy, leads to robust representations that outperform well-known contrastive losses on the retrieval task.
- Score: 4.137464623395376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scene graphs are a powerful structured representation of the underlying
content of images, and embeddings derived from them have been shown to be
useful in multiple downstream tasks. In this work, we employ a graph
convolutional network to exploit structure in scene graphs and produce image
embeddings useful for semantic image retrieval. Different from
classification-centric supervision traditionally available for learning image
representations, we address the task of learning from relative similarity
labels in a ranking context. Rooted within the contrastive learning paradigm,
we propose a novel loss function that operates on pairs of similar and
dissimilar images and imposes relative ordering between them in embedding
space. We demonstrate that this Ranking loss, coupled with an intuitive triple
sampling strategy, leads to robust representations that outperform well-known
contrastive losses on the retrieval task. In addition, we provide qualitative
evidence of how retrieved results that utilize structured scene information
capture the global context of the scene, different from visual similarity
search.
Related papers
- FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - Coarse-to-Fine Contrastive Learning in Image-Text-Graph Space for
Improved Vision-Language Compositionality [50.48859793121308]
Contrastively trained vision-language models have achieved remarkable progress in vision and language representation learning.
Recent research has highlighted severe limitations in their ability to perform compositional reasoning over objects, attributes, and relations.
arXiv Detail & Related papers (2023-05-23T08:28:38Z) - Contextual Similarity Aggregation with Self-attention for Visual
Re-ranking [96.55393026011811]
We propose a visual re-ranking method by contextual similarity aggregation with self-attention.
We conduct comprehensive experiments on four benchmark datasets to demonstrate the generality and effectiveness of our proposed visual re-ranking method.
arXiv Detail & Related papers (2021-10-26T06:20:31Z) - RTIC: Residual Learning for Text and Image Composition using Graph
Convolutional Network [19.017377597937617]
We study the compositional learning of images and texts for image retrieval.
We introduce a novel method that combines the graph convolutional network (GCN) with existing composition methods.
arXiv Detail & Related papers (2021-04-07T09:41:52Z) - Image-to-Image Retrieval by Learning Similarity between Scene Graphs [5.284353899197193]
We propose a novel approach for image-to-image retrieval using scene graph similarity measured by graph neural networks.
In our approach, graph neural networks are trained to predict the proxy image relevance measure, computed from human-annotated captions.
arXiv Detail & Related papers (2020-12-29T10:45:20Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z) - Improving Image Captioning with Better Use of Captions [65.39641077768488]
We present a novel image captioning architecture to better explore semantics available in captions and leverage that to enhance both image representation and caption generation.
Our models first construct caption-guided visual relationship graphs that introduce beneficial inductive bias using weakly supervised multi-instance learning.
During generation, the model further incorporates visual relationships using multi-task learning for jointly predicting word and object/predicate tag sequences.
arXiv Detail & Related papers (2020-06-21T14:10:47Z) - Learning to Compare Relation: Semantic Alignment for Few-Shot Learning [48.463122399494175]
We present a novel semantic alignment model to compare relations, which is robust to content misalignment.
We conduct extensive experiments on several few-shot learning datasets.
arXiv Detail & Related papers (2020-02-29T08:37:02Z) - Deep Multimodal Image-Text Embeddings for Automatic Cross-Media
Retrieval [0.0]
We introduce an end-to-end deep multimodal convolutional-recurrent network for learning both vision and language representations simultaneously.
The model learns which pairs are a match (positive) and which ones are a mismatch (negative) using a hinge-based triplet ranking.
arXiv Detail & Related papers (2020-02-23T23:58:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.