DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local
and Global Features
- URL: http://arxiv.org/abs/2108.02927v1
- Date: Fri, 6 Aug 2021 03:14:09 GMT
- Title: DOLG: Single-Stage Image Retrieval with Deep Orthogonal Fusion of Local
and Global Features
- Authors: Min Yang, Dongliang He, Miao Fan, Baorong Shi, Xuetong Xue, Fu Li,
Errui Ding, Jizhou Huang
- Abstract summary: We propose a Deep Orthogonal Local and Global (DOLG) information fusion framework for end-to-end image retrieval.
It attentively extracts representative local information with multi-atrous convolutions and self-attention at first.
The whole framework is end-to-end differentiable and can be trained with image-level labels.
- Score: 42.62089148690047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image Retrieval is a fundamental task of obtaining images similar to the
query one from a database. A common image retrieval practice is to firstly
retrieve candidate images via similarity search using global image features and
then re-rank the candidates by leveraging their local features. Previous
learning-based studies mainly focus on either global or local image
representation learning to tackle the retrieval task. In this paper, we abandon
the two-stage paradigm and seek to design an effective single-stage solution by
integrating local and global information inside images into compact image
representations. Specifically, we propose a Deep Orthogonal Local and Global
(DOLG) information fusion framework for end-to-end image retrieval. It
attentively extracts representative local information with multi-atrous
convolutions and self-attention at first. Components orthogonal to the global
image representation are then extracted from the local information. At last,
the orthogonal components are concatenated with the global representation as a
complementary, and then aggregation is performed to generate the final
representation. The whole framework is end-to-end differentiable and can be
trained with image-level labels. Extensive experimental results validate the
effectiveness of our solution and show that our model achieves state-of-the-art
image retrieval performances on Revisited Oxford and Paris datasets.
Related papers
- Contrasting Deepfakes Diffusion via Contrastive Learning and Global-Local Similarities [88.398085358514]
Contrastive Deepfake Embeddings (CoDE) is a novel embedding space specifically designed for deepfake detection.
CoDE is trained via contrastive learning by additionally enforcing global-local similarities.
arXiv Detail & Related papers (2024-07-29T18:00:10Z) - Beyond LLaVA-HD: Diving into High-Resolution Large Multimodal Models [44.437693135170576]
We propose a new framework, LMM with Sophisticated Tasks, Local image compression, and Mixture of global Experts (SliME)
We extract contextual information from the global view using a mixture of adapters, based on the observation that different adapters excel at different tasks.
The proposed method achieves leading performance across various benchmarks with only 2 million training data.
arXiv Detail & Related papers (2024-06-12T17:59:49Z) - mTREE: Multi-Level Text-Guided Representation End-to-End Learning for Whole Slide Image Analysis [16.472295458683696]
Multi-modal learning adeptly integrates visual and textual data, but its application to histopathology image and text analysis remains challenging.
We introduce Multi-Level Text-Guided Representation End-to-End Learning (mTREE)
This novel text-guided approach effectively captures multi-scale Whole Slide Images (WSIs) by utilizing information from accompanying textual pathology information.
arXiv Detail & Related papers (2024-05-28T04:47:44Z) - Coarse-to-Fine: Learning Compact Discriminative Representation for
Single-Stage Image Retrieval [11.696941841000985]
Two-stage methods following retrieve-and-rerank paradigm have achieved excellent performance, but their separate local and global modules are inefficient to real-world applications.
We propose a mechanism which attentively selects prominent local descriptors and infuse fine-grained semantic relations into the global representation.
Our method achieves state-of-the-art single-stage image retrieval performance on benchmarks such as Revisited Oxford and Revisited Paris.
arXiv Detail & Related papers (2023-08-08T03:06:10Z) - PRIOR: Prototype Representation Joint Learning from Medical Images and
Reports [19.336988866061294]
We present a prototype representation learning framework incorporating both global and local alignment between medical images and reports.
In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation.
A sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features.
arXiv Detail & Related papers (2023-07-24T07:49:01Z) - Image-Specific Information Suppression and Implicit Local Alignment for
Text-based Person Search [61.24539128142504]
Text-based person search (TBPS) is a challenging task that aims to search pedestrian images with the same identity from an image gallery given a query text.
Most existing methods rely on explicitly generated local parts to model fine-grained correspondence between modalities.
We propose an efficient joint Multi-level Alignment Network (MANet) for TBPS, which can learn aligned image/text feature representations between modalities at multiple levels.
arXiv Detail & Related papers (2022-08-30T16:14:18Z) - Local and Global GANs with Semantic-Aware Upsampling for Image
Generation [201.39323496042527]
We consider generating images using local context.
We propose a class-specific generative network using semantic maps as guidance.
Lastly, we propose a novel semantic-aware upsampling method.
arXiv Detail & Related papers (2022-02-28T19:24:25Z) - Boosting Few-shot Semantic Segmentation with Transformers [81.43459055197435]
TRansformer-based Few-shot Semantic segmentation method (TRFS)
Our model consists of two modules: Global Enhancement Module (GEM) and Local Enhancement Module (LEM)
arXiv Detail & Related papers (2021-08-04T20:09:21Z) - Inter-Image Communication for Weakly Supervised Localization [77.2171924626778]
Weakly supervised localization aims at finding target object regions using only image-level supervision.
We propose to leverage pixel-level similarities across different objects for learning more accurate object locations.
Our method achieves the Top-1 localization error rate of 45.17% on the ILSVRC validation set.
arXiv Detail & Related papers (2020-08-12T04:14:11Z) - Unifying Deep Local and Global Features for Image Search [9.614694312155798]
We unify global and local image features into a single deep model, enabling accurate retrieval with efficient feature extraction.
Our model achieves state-of-the-art image retrieval on the Revisited Oxford and Paris datasets, and state-of-the-art single-model instance-level recognition on the Google Landmarks dataset v2.
arXiv Detail & Related papers (2020-01-14T19:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.