Related papers: Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization

Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization

URL: http://arxiv.org/abs/2402.02141v3
Date: Thu, 16 May 2024 03:00:22 GMT
Title: Zero-shot sketch-based remote sensing image retrieval based on multi-level and attention-guided tokenization
Authors: Bo Yang, Chen Wang, Xiaoshuang Ma, Beiping Song, Zhuang Liu, Fangde Sun,
Abstract summary: This study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images. It employs multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update. Our method significantly outperforms existing sketch-based remote sensing image retrieval techniques.
Score: 8.678089483952474
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Effectively and efficiently retrieving images from remote sensing databases is a critical challenge in the realm of remote sensing big data. Utilizing hand-drawn sketches as retrieval inputs offers intuitive and user-friendly advantages, yet the potential of multi-level feature integration from sketches remains underexplored, leading to suboptimal retrieval performance. To address this gap, our study introduces a novel zero-shot, sketch-based retrieval method for remote sensing images, leveraging multi-level feature extraction, self-attention-guided tokenization and filtering, and cross-modality attention update. This approach employs only vision information and does not require semantic knowledge concerning the sketch and image. It starts by employing multi-level self-attention guided feature extraction to tokenize the query sketches, as well as self-attention feature extraction to tokenize the candidate images. It then employs cross-attention mechanisms to establish token correspondence between these two modalities, facilitating the computation of sketch-to-image similarity. Our method significantly outperforms existing sketch-based remote sensing image retrieval techniques, as evidenced by tests on multiple datasets. Notably, it also exhibits robust zero-shot learning capabilities and strong generalizability in handling unseen categories and novel remote sensing data. The method's scalability can be further enhanced by the pre-calculation of retrieval tokens for all candidate images in a database. This research underscores the significant potential of multi-level, attention-guided tokenization in cross-modal remote sensing image retrieval. For broader accessibility and research facilitation, we have made the code and dataset used in this study publicly available online. Code and dataset are available at https://github.com/Snowstormfly/Cross-modal-retrieval-MLAGT.

Related papers

Towards Text-Image Interleaved Retrieval [49.96332254241075]
We introduce the text-image interleaved retrieval (TIIR) task, where the query and document are interleaved text-image sequences. We construct a TIIR benchmark based on naturally interleaved wikiHow tutorials, where a specific pipeline is designed to generate interleaved queries. We propose a novel Matryoshka Multimodal Embedder (MME), which compresses the number of visual tokens at different granularity.
arXiv Detail & Related papers (2025-02-18T12:00:47Z)
Graph-Based Cross-Domain Knowledge Distillation for Cross-Dataset Text-to-Image Person Retrieval [25.760438764541867]
Video surveillance systems are crucial components for ensuring public safety and management in smart city. Text-to-image person retrieval aims to retrieve the target person from an image gallery that best matches the given text description. Most existing text-to-image person retrieval methods are trained in a supervised manner that requires sufficient labeled data in the target domain.
arXiv Detail & Related papers (2025-01-25T03:24:34Z)
Revolutionizing Text-to-Image Retrieval as Autoregressive Token-to-Voken Generation [90.71613903956451]
Text-to-image retrieval is a fundamental task in multimedia processing. We propose an autoregressive voken generation method, named AVG. We show that AVG achieves superior results in both effectiveness and efficiency.
arXiv Detail & Related papers (2024-07-24T13:39:51Z)
Knowledge-aware Text-Image Retrieval for Remote Sensing Images [6.4527372338977]
Cross-modal text-image retrieval often suffers from information asymmetry between texts and images. By mining relevant information from an external knowledge graph, we propose a Knowledge-aware Text-Image Retrieval. We show that the proposed knowledge-aware method leads to varied and consistent retrievals, outperforming state-of-the-art retrieval methods.
arXiv Detail & Related papers (2024-05-06T11:27:27Z)
Object-Centric Open-Vocabulary Image-Retrieval with Aggregated Features [12.14013374452918]
We present a simple yet effective approach to object-centric open-vocabulary image retrieval. Our approach aggregates dense embeddings extracted from CLIP into a compact representation. We show the effectiveness of our scheme to the task by achieving significantly better results than global feature approaches on three datasets.
arXiv Detail & Related papers (2023-09-26T15:13:09Z)
Self-Correlation and Cross-Correlation Learning for Few-Shot Remote Sensing Image Semantic Segmentation [27.59330408178435]
Few-shot remote sensing semantic segmentation aims at learning to segment target objects from a query image. We propose a Self-Correlation and Cross-Correlation Learning Network for the few-shot remote sensing image semantic segmentation. Our model enhances the generalization by considering both self-correlation and cross-correlation between support and query images.
arXiv Detail & Related papers (2023-09-11T21:53:34Z)
Learning Transferable Pedestrian Representation from Multimodal Information Supervision [174.5150760804929]
VAL-PAT is a novel framework that learns transferable representations to enhance various pedestrian analysis tasks with multimodal information. We first perform pre-training on LUPerson-TA dataset, where each image contains text and attribute annotations. We then transfer the learned representations to various downstream tasks, including person reID, person attribute recognition and text-based person search.
arXiv Detail & Related papers (2023-04-12T01:20:58Z)
Unleash the Potential of Image Branch for Cross-modal 3D Object Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects. First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation. Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z)
Cross-Modal Fusion Distillation for Fine-Grained Sketch-Based Image Retrieval [55.21569389894215]
We propose a cross-attention framework for Vision Transformers (XModalViT) that fuses modality-specific information instead of discarding them. Our framework first maps paired datapoints from the individual photo and sketch modalities to fused representations that unify information from both modalities. We then decouple the input space of the aforementioned modality fusion network into independent encoders of the individual modalities via contrastive and relational cross-modal knowledge distillation.
arXiv Detail & Related papers (2022-10-19T11:50:14Z)
LEAD: Self-Supervised Landmark Estimation by Aligning Distributions of Feature Similarity [49.84167231111667]
Existing works in self-supervised landmark detection are based on learning dense (pixel-level) feature representations from an image. We introduce an approach to enhance the learning of dense equivariant representations in a self-supervised fashion. We show that having such a prior in the feature extractor helps in landmark detection, even under drastically limited number of annotations.
arXiv Detail & Related papers (2022-04-06T17:48:18Z)
Correlation-Aware Deep Tracking [83.51092789908677]
We propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. Our network deeply embeds cross-image feature correlation in multiple layers of the feature network. Our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
arXiv Detail & Related papers (2022-03-03T11:53:54Z)
Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning [0.0]
In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture. Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part. Considering the associated texts to the images can help to improve the accuracy depending on the goal.
arXiv Detail & Related papers (2021-07-08T10:54:59Z)
AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation [3.6790362352712873]
We propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures. Our experiments demonstrate that the method is able to represent the image in low dimensional space. Unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets.
arXiv Detail & Related papers (2021-06-11T09:02:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.