Similarity Reasoning and Filtration for Image-Text Matching
- URL: http://arxiv.org/abs/2101.01368v1
- Date: Tue, 5 Jan 2021 06:29:35 GMT
- Title: Similarity Reasoning and Filtration for Image-Text Matching
- Authors: Haiwen Diao, Ying Zhang, Lin Ma, Huchuan Lu
- Abstract summary: We propose a novel Similarity Graph Reasoning and Attention filtration network for image-text matching.
Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments.
We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets.
- Score: 85.68854427456249
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text matching plays a critical role in bridging the vision and
language, and great progress has been made by exploiting the global alignment
between image and sentence, or local alignments between regions and words.
However, how to make the most of these alignments to infer more accurate
matching scores is still underexplored. In this paper, we propose a novel
Similarity Graph Reasoning and Attention Filtration (SGRAF) network for
image-text matching. Specifically, the vector-based similarity representations
are firstly learned to characterize the local and global alignments in a more
comprehensive manner, and then the Similarity Graph Reasoning (SGR) module
relying on one graph convolutional neural network is introduced to infer
relation-aware similarities with both the local and global alignments. The
Similarity Attention Filtration (SAF) module is further developed to integrate
these alignments effectively by selectively attending on the significant and
representative alignments and meanwhile casting aside the interferences of
non-meaningful alignments. We demonstrate the superiority of the proposed
method with achieving state-of-the-art performances on the Flickr30K and MSCOCO
datasets, and the good interpretability of SGR and SAF modules with extensive
qualitative experiments and analyses.
Related papers
- Clustering-based Image-Text Graph Matching for Domain Generalization [13.277406473107721]
Domain-invariant visual representations are important to train a model that can generalize well to unseen target task domains.
Recent works demonstrate that text descriptions contain high-level class-discriminative information.
We advocate for the use of local alignment between image regions and corresponding textual descriptions.
arXiv Detail & Related papers (2023-10-04T10:03:07Z) - HGAN: Hierarchical Graph Alignment Network for Image-Text Retrieval [13.061063817876336]
We propose a novel Hierarchical Graph Alignment Network (HGAN) for image-text retrieval.
First, to capture the comprehensive multimodal features, we construct the feature graphs for the image and text modality respectively.
Then, a multi-granularity shared space is established with a designed Multi-granularity Feature Aggregation and Rearrangement (MFAR) module.
Finally, the ultimate image and text features are further refined through three-level similarity functions to achieve the hierarchical alignment.
arXiv Detail & Related papers (2022-12-16T05:08:52Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z) - Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images.
We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM)
The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z) - DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor
Points [15.953570826460869]
Establishing dense correspondence between two images is a fundamental computer vision problem.
We introduce DenseGAP, a new solution for efficient Dense correspondence learning with a Graph-structured neural network conditioned on Anchor Points.
Our method advances the state-of-the-art of correspondence learning on most benchmarks.
arXiv Detail & Related papers (2021-12-13T18:59:30Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Multi-Level Graph Convolutional Network with Automatic Graph Learning
for Hyperspectral Image Classification [63.56018768401328]
We propose a Multi-level Graph Convolutional Network (GCN) with Automatic Graph Learning method (MGCN-AGL) for HSI classification.
By employing attention mechanism to characterize the importance among spatially neighboring regions, the most relevant information can be adaptively incorporated to make decisions.
Our MGCN-AGL encodes the long range dependencies among image regions based on the expressive representations that have been produced at local level.
arXiv Detail & Related papers (2020-09-19T09:26:20Z) - GINet: Graph Interaction Network for Scene Parsing [58.394591509215005]
We propose a Graph Interaction unit (GI unit) and a Semantic Context Loss (SC-loss) to promote context reasoning over image regions.
The proposed GINet outperforms the state-of-the-art approaches on the popular benchmarks, including Pascal-Context and COCO Stuff.
arXiv Detail & Related papers (2020-09-14T02:52:45Z) - High-Order Information Matters: Learning Relation and Topology for
Occluded Person Re-Identification [84.43394420267794]
We propose a novel framework by learning high-order relation and topology information for discriminative features and robust alignment.
Our framework significantly outperforms state-of-the-art by6.5%mAP scores on Occluded-Duke dataset.
arXiv Detail & Related papers (2020-03-18T12:18:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.