Hierarchical Matching and Reasoning for Multi-Query Image Retrieval
- URL: http://arxiv.org/abs/2306.14460v1
- Date: Mon, 26 Jun 2023 07:03:56 GMT
- Title: Hierarchical Matching and Reasoning for Multi-Query Image Retrieval
- Authors: Zhong Ji, Zhihao Li, Yan Zhang, Haoran Wang, Yanwei Pang, Xuelong Li
- Abstract summary: We propose a novel Hierarchical Matching and Reasoning Network (HMRN) for Multi-Query Image Retrieval (MQIR)
It disentangles MQIR into three hierarchical semantic representations, which is responsible to capture fine-grained local details, contextual global scopes, and high-level inherent correlations.
Our HMRN substantially surpasses the current state-of-the-art methods.
- Score: 113.44470784756308
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As a promising field, Multi-Query Image Retrieval (MQIR) aims at searching
for the semantically relevant image given multiple region-specific text
queries. Existing works mainly focus on a single-level similarity between image
regions and text queries, which neglects the hierarchical guidance of
multi-level similarities and results in incomplete alignments. Besides, the
high-level semantic correlations that intrinsically connect different
region-query pairs are rarely considered. To address above limitations, we
propose a novel Hierarchical Matching and Reasoning Network (HMRN) for MQIR. It
disentangles MQIR into three hierarchical semantic representations, which is
responsible to capture fine-grained local details, contextual global scopes,
and high-level inherent correlations. HMRN comprises two modules: Scalar-based
Matching (SM) module and Vector-based Reasoning (VR) module. Specifically, the
SM module characterizes the multi-level alignment similarity, which consists of
a fine-grained local-level similarity and a context-aware global-level
similarity. Afterwards, the VR module is developed to excavate the potential
semantic correlations among multiple region-query pairs, which further explores
the high-level reasoning similarity. Finally, these three-level similarities
are aggregated into a joint similarity space to form the ultimate similarity.
Extensive experiments on the benchmark dataset demonstrate that our HMRN
substantially surpasses the current state-of-the-art methods. For instance,
compared with the existing best method Drill-down, the metric R@1 in the last
round is improved by 23.4%. Our source codes will be released at
https://github.com/LZH-053/HMRN.
Related papers
- MGMapNet: Multi-Granularity Representation Learning for End-to-End Vectorized HD Map Construction [75.93907511203317]
We propose MGMapNet (Multi-Granularity Map Network) to model map element with a multi-granularity representation.
The proposed MGMapNet achieves state-of-the-art performance, surpassing MapTRv2 by 5.3 mAP on nuScenes and 4.4 mAP on Argoverse2 respectively.
arXiv Detail & Related papers (2024-10-10T09:05:23Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation [36.992698016947486]
Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary categories specified by class labels or captions.
Previous best-performing methods suffer from false matches between image features and category labels.
We propose a novel relation-aware intra-modal matching framework for OVS based on visual foundation models.
arXiv Detail & Related papers (2024-03-30T06:29:59Z) - Learnable Pillar-based Re-ranking for Image-Text Retrieval [119.9979224297237]
Image-text retrieval aims to bridge the modality gap and retrieve cross-modal content based on semantic similarities.
Re-ranking, a popular post-processing practice, has revealed the superiority of capturing neighbor relations in single-modality retrieval tasks.
We propose a novel learnable pillar-based re-ranking paradigm for image-text retrieval.
arXiv Detail & Related papers (2023-04-25T04:33:27Z) - Global-and-Local Collaborative Learning for Co-Salient Object Detection [162.62642867056385]
The goal of co-salient object detection (CoSOD) is to discover salient objects that commonly appear in a query group containing two or more relevant images.
We propose a global-and-local collaborative learning architecture, which includes a global correspondence modeling (GCM) and a local correspondence modeling (LCM)
The proposed GLNet is evaluated on three prevailing CoSOD benchmark datasets, demonstrating that our model trained on a small dataset (about 3k images) still outperforms eleven state-of-the-art competitors trained on some large datasets (about 8k-200k images)
arXiv Detail & Related papers (2022-04-19T14:32:41Z) - Multi-similarity based Hyperrelation Network for few-shot segmentation [2.306100133614193]
Few-shot semantic segmentation aims at recognizing the object regions of unseen categories with only a few examples as supervision.
We propose an effective Multi-similarity Hyperrelation Network (MSHNet) to tackle the few-shot semantic segmentation problem.
arXiv Detail & Related papers (2022-03-17T18:16:52Z) - Multi-Scale Feature Aggregation by Cross-Scale Pixel-to-Region Relation
Operation for Semantic Segmentation [44.792859259093085]
We aim to enable the low-level feature to aggregate the complementary context from adjacent high-level feature maps by a cross-scale pixel-to-region operation.
We employ an efficient feature pyramid network to obtain multi-scale features.
Experiment results show that the RSP head performs competitively on both semantic segmentation and panoptic segmentation with high efficiency.
arXiv Detail & Related papers (2021-06-03T10:49:48Z) - Associating Multi-Scale Receptive Fields for Fine-grained Recognition [5.079292308180334]
We propose a novel cross-layer non-local (CNL) module to associate multi-scale receptive fields by two operations.
CNL computes correlations between features of a query layer and all response layers.
Our model builds spatial dependencies among multi-level layers and learns more discriminative features.
arXiv Detail & Related papers (2020-05-19T01:16:31Z) - Universal-RCNN: Universal Object Detector via Transferable Graph R-CNN [117.80737222754306]
We present a novel universal object detector called Universal-RCNN.
We first generate a global semantic pool by integrating all high-level semantic representation of all the categories.
An Intra-Domain Reasoning Module learns and propagates the sparse graph representation within one dataset guided by a spatial-aware GCN.
arXiv Detail & Related papers (2020-02-18T07:57:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.