A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval
- URL: http://arxiv.org/abs/2106.02400v1
- Date: Fri, 4 Jun 2021 10:33:14 GMT
- Title: A Deep Local and Global Scene-Graph Matching for Image-Text Retrieval
- Authors: Manh-Duy Nguyen, Binh T. Nguyen, Cathal Gurrin
- Abstract summary: Scene graph presentation is a suitable method for the image-text matching challenge.
We introduce the Local and Global Scene Graph Matching (LGSGM) model that enhances the state-of-the-art method.
Our enhancement with the combination of levels can improve the performance of the baseline method by increasing the recall by more than 10% on the Flickr30k dataset.
- Score: 4.159666152160874
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Conventional approaches to image-text retrieval mainly focus on indexing
visual objects appearing in pictures but ignore the interactions between these
objects. Such objects occurrences and interactions are equivalently useful and
important in this field as they are usually mentioned in the text. Scene graph
presentation is a suitable method for the image-text matching challenge and
obtained good results due to its ability to capture the inter-relationship
information. Both images and text are represented in scene graph levels and
formulate the retrieval challenge as a scene graph matching challenge. In this
paper, we introduce the Local and Global Scene Graph Matching (LGSGM) model
that enhances the state-of-the-art method by integrating an extra graph
convolution network to capture the general information of a graph.
Specifically, for a pair of scene graphs of an image and its caption, two
separate models are used to learn the features of each graph's nodes and edges.
Then a Siamese-structure graph convolution model is employed to embed graphs
into vector forms. We finally combine the graph-level and the vector-level to
calculate the similarity of this image-text pair. The empirical experiments
show that our enhancement with the combination of levels can improve the
performance of the baseline method by increasing the recall by more than 10% on
the Flickr30k dataset.
Related papers
- G-Retriever: Retrieval-Augmented Generation for Textual Graph Understanding and Question Answering [61.93058781222079]
We develop a flexible question-answering framework targeting real-world textual graphs.
We introduce the first retrieval-augmented generation (RAG) approach for general textual graphs.
G-Retriever performs RAG over a graph by formulating this task as a Prize-Collecting Steiner Tree optimization problem.
arXiv Detail & Related papers (2024-02-12T13:13:04Z) - Image Synthesis with Graph Conditioning: CLIP-Guided Diffusion Models for Scene Graphs [0.0]
We introduce a novel approach to generate images from scene graphs.
We leverage pre-trained text-to-image diffusion models and CLIP guidance to translate graph knowledge into images.
Elaborate experiments reveal that our method outperforms existing methods on standard benchmarks.
arXiv Detail & Related papers (2024-01-25T11:46:31Z) - Clustering-based Image-Text Graph Matching for Domain Generalization [13.277406473107721]
Domain-invariant visual representations are important to train a model that can generalize well to unseen target task domains.
Recent works demonstrate that text descriptions contain high-level class-discriminative information.
We advocate for the use of local alignment between image regions and corresponding textual descriptions.
arXiv Detail & Related papers (2023-10-04T10:03:07Z) - FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph
Parsing [66.70054075041487]
Existing scene graphs that convert image captions into scene graphs often suffer from two types of errors.
First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness.
Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.
arXiv Detail & Related papers (2023-05-27T15:38:31Z) - CGMN: A Contrastive Graph Matching Network for Self-Supervised Graph
Similarity Learning [65.1042892570989]
We propose a contrastive graph matching network (CGMN) for self-supervised graph similarity learning.
We employ two strategies, namely cross-view interaction and cross-graph interaction, for effective node representation learning.
We transform node representations into graph-level representations via pooling operations for graph similarity computation.
arXiv Detail & Related papers (2022-05-30T13:20:26Z) - SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense
Reasoning [61.57887011165744]
multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning.
We propose a Scene Graph Enhanced Image-Text Learning framework to incorporate visual scene graphs in commonsense reasoning.
arXiv Detail & Related papers (2021-12-16T03:16:30Z) - Learning to Generate Scene Graph from Natural Language Supervision [52.18175340725455]
We propose one of the first methods that learn from image-sentence pairs to extract a graphical representation of localized objects and their relationships within an image, known as scene graph.
We leverage an off-the-shelf object detector to identify and localize object instances, match labels of detected regions to concepts parsed from captions, and thus create "pseudo" labels for learning scene graph.
arXiv Detail & Related papers (2021-09-06T03:38:52Z) - Bridging Knowledge Graphs to Generate Scene Graphs [49.69377653925448]
We propose a novel graph-based neural network that iteratively propagates information between the two graphs, as well as within each of them.
Our Graph Bridging Network, GB-Net, successively infers edges and nodes, allowing to simultaneously exploit and refine the rich, heterogeneous structure of the interconnected scene and commonsense graphs.
arXiv Detail & Related papers (2020-01-07T23:35:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.