DALG: Deep Attentive Local and Global Modeling for Image Retrieval
- URL: http://arxiv.org/abs/2207.00287v1
- Date: Fri, 1 Jul 2022 09:32:15 GMT
- Title: DALG: Deep Attentive Local and Global Modeling for Image Retrieval
- Authors: Yuxin Song, Ruolin Zhu, Min Yang and Dongliang He
- Abstract summary: We propose a fully attention based framework for robust representation learning motivated by the success of Transformer.
Besides applying Transformer for global feature extraction, we devise a local branch composed of window-based multi-head attention and spatial attention.
With our Deep Attentive Local and Global modeling framework (DALG), extensive experimental results show that efficiency can be significantly improved.
- Score: 26.773211032906854
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deeply learned representations have achieved superior image retrieval
performance in a retrieve-then-rerank manner. Recent state-of-the-art single
stage model, which heuristically fuses local and global features, achieves
promising trade-off between efficiency and effectiveness. However, we notice
that efficiency of existing solutions is still restricted because of their
multi-scale inference paradigm. In this paper, we follow the single stage art
and obtain further complexity-effectiveness balance by successfully getting rid
of multi-scale testing. To achieve this goal, we abandon the widely-used
convolution network giving its limitation in exploring diverse visual patterns,
and resort to fully attention based framework for robust representation
learning motivated by the success of Transformer. Besides applying Transformer
for global feature extraction, we devise a local branch composed of
window-based multi-head attention and spatial attention to fully exploit local
image patterns. Furthermore, we propose to combine the hierarchical local and
global features via a cross-attention module, instead of using heuristically
fusion as previous art does. With our Deep Attentive Local and Global modeling
framework (DALG), extensive experimental results show that efficiency can be
significantly improved while maintaining competitive results with the state of
the arts.
Related papers
- Localization, balance and affinity: a stronger multifaceted collaborative salient object detector in remote sensing images [24.06927394483275]
We propose a stronger multifaceted collaborative salient object detector in ORSIs, termed LBA-MCNet.
The network focuses on accurately locating targets, balancing detailed features, and modeling image-level global context information.
arXiv Detail & Related papers (2024-10-31T14:50:48Z) - Beyond Local Views: Global State Inference with Diffusion Models for Cooperative Multi-Agent Reinforcement Learning [36.25611963252774]
State Inference with Diffusion Models (SIDIFF) is inspired by image outpainting.
SIDIFF reconstructs the original global state based solely on local observations.
It can be effortlessly incorporated into current multi-agent reinforcement learning algorithms.
arXiv Detail & Related papers (2024-08-18T14:49:53Z) - Decentralized Transformers with Centralized Aggregation are Sample-Efficient Multi-Agent World Models [106.94827590977337]
We propose a novel world model for Multi-Agent RL (MARL) that learns decentralized local dynamics for scalability.
We also introduce a Perceiver Transformer as an effective solution to enable centralized representation aggregation.
Results on Starcraft Multi-Agent Challenge (SMAC) show that it outperforms strong model-free approaches and existing model-based methods in both sample efficiency and overall performance.
arXiv Detail & Related papers (2024-06-22T12:40:03Z) - Multi-view Aggregation Network for Dichotomous Image Segmentation [76.75904424539543]
Dichotomous Image (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images.
Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement.
Inspired by it, we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet)
Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed.
arXiv Detail & Related papers (2024-04-11T03:00:00Z) - Optimization Efficient Open-World Visual Region Recognition [55.76437190434433]
RegionSpot integrates position-aware localization knowledge from a localization foundation model with semantic information from a ViL model.
Experiments in open-world object recognition show that our RegionSpot achieves significant performance gain over prior alternatives.
arXiv Detail & Related papers (2023-11-02T16:31:49Z) - Multi-Scale and Multi-Layer Contrastive Learning for Domain Generalization [5.124256074746721]
We argue that the generalization ability of deep convolutional neural networks can be improved by taking advantage of multi-layer and multi-scaled representations of the network.
We introduce a framework that aims at improving domain generalization of image classifiers by combining both low-level and high-level features at multiple scales.
We show that our model is able to surpass the performance of previous DG methods and consistently produce competitive and state-of-the-art results in all datasets.
arXiv Detail & Related papers (2023-08-28T08:54:27Z) - Multi-modal Gated Mixture of Local-to-Global Experts for Dynamic Image
Fusion [59.19469551774703]
Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks.
We propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts.
Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate.
arXiv Detail & Related papers (2023-02-02T20:06:58Z) - Mutual Guidance and Residual Integration for Image Enhancement [43.282397174228116]
We propose a novel mutual guidance network (MGN) to perform effective bidirectional global-local information exchange.
In our design, we adopt a two-branch framework where one branch focuses more on modeling global relations while the other is committed to processing local information.
As a result, both the global and local branches can enjoy the merits of mutual information aggregation.
arXiv Detail & Related papers (2022-11-25T06:12:39Z) - Locality Matters: A Scalable Value Decomposition Approach for
Cooperative Multi-Agent Reinforcement Learning [52.7873574425376]
Cooperative multi-agent reinforcement learning (MARL) faces significant scalability issues due to state and action spaces that are exponentially large in the number of agents.
We propose a novel, value-based multi-agent algorithm called LOMAQ, which incorporates local rewards in the Training Decentralized Execution paradigm.
arXiv Detail & Related papers (2021-09-22T10:08:15Z) - Video Salient Object Detection via Adaptive Local-Global Refinement [7.723369608197167]
Video salient object detection (VSOD) is an important task in many vision applications.
We propose an adaptive local-global refinement framework for VSOD.
We show that our weighting methodology can further exploit the feature correlations, thus driving the network to learn more discriminative feature representation.
arXiv Detail & Related papers (2021-04-29T14:14:11Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.