Asymmetric Cross-Scale Alignment for Text-Based Person Search
- URL: http://arxiv.org/abs/2212.11958v1
- Date: Sat, 26 Nov 2022 08:34:35 GMT
- Title: Asymmetric Cross-Scale Alignment for Text-Based Person Search
- Authors: Zhong Ji, Junhua Hu, Deyin Liu, Lin Yuanbo Wu, Ye zhao
- Abstract summary: Text-based person search (TBPS) is of significant importance in intelligent surveillance, which aims to retrieve pedestrian images with high semantic relevance to a given text description.
To implement this task, one needs to extract multi-scale features from both image and text domains, and then perform the cross-modal alignment.
We present a transformer-based model to extract multi-scale representations, and perform Asymmetric Cross-Scale Alignment (ACSA) to precisely align the two modalities.
- Score: 15.618984100653348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-based person search (TBPS) is of significant importance in intelligent
surveillance, which aims to retrieve pedestrian images with high semantic
relevance to a given text description. This retrieval task is characterized
with both modal heterogeneity and fine-grained matching. To implement this
task, one needs to extract multi-scale features from both image and text
domains, and then perform the cross-modal alignment. However, most existing
approaches only consider the alignment confined at their individual scales,
e.g., an image-sentence or a region-phrase scale. Such a strategy adopts the
presumable alignment in feature extraction, while overlooking the cross-scale
alignment, e.g., image-phrase. In this paper, we present a transformer-based
model to extract multi-scale representations, and perform Asymmetric
Cross-Scale Alignment (ACSA) to precisely align the two modalities.
Specifically, ACSA consists of a global-level alignment module and an
asymmetric cross-attention module, where the former aligns an image and texts
on a global scale, and the latter applies the cross-attention mechanism to
dynamically align the cross-modal entities in region/image-phrase scales.
Extensive experiments on two benchmark datasets CUHK-PEDES and RSTPReid
demonstrate the effectiveness of our approach. Codes are available at
\href{url}{https://github.com/mul-hjh/ACSA}.
Related papers
- Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval [37.775529830620016]
Remote Sensing Image-Text Retrieval (RSITR) is pivotal for knowledge services and data mining in the remote sensing (RS) domain.
Current multi-scale RSITR approaches typically align multi-scale fused image features with text features, but overlook aligning image-text pairs at distinct scales separately.
We introduce a novel Multi-Scale Alignment (MSA) method to overcome this limitation.
arXiv Detail & Related papers (2024-05-29T10:19:11Z) - Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion.
It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing.
Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z) - Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation [63.15257949821558]
Referring Remote Sensing Image (RRSIS) is a new challenge that combines computer vision and natural language processing.
Traditional Referring Image (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery.
We introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS.
arXiv Detail & Related papers (2023-12-19T08:14:14Z) - Step-Wise Hierarchical Alignment Network for Image-Text Matching [29.07229472373576]
We propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process.
Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially.
arXiv Detail & Related papers (2021-06-11T17:05:56Z) - Similarity Reasoning and Filtration for Image-Text Matching [85.68854427456249]
We propose a novel Similarity Graph Reasoning and Attention filtration network for image-text matching.
Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments.
We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets.
arXiv Detail & Related papers (2021-01-05T06:29:35Z) - Seed the Views: Hierarchical Semantic Alignment for Contrastive
Representation Learning [116.91819311885166]
We propose a hierarchical semantic alignment strategy via expanding the views generated by a single image to textbfCross-samples and Multi-level representation.
Our method, termed as CsMl, has the ability to integrate multi-level visual representations across samples in a robust way.
arXiv Detail & Related papers (2020-12-04T17:26:24Z) - Fine-grained Visual Textual Alignment for Cross-Modal Retrieval using
Transformer Encoders [14.634046503477979]
We present a novel approach called Transformer Reasoning and Alignment Network (TERAN)
TERAN enforces a fine-grained match between the underlying components of images and sentences.
On the MS-COCO 1K test set, we obtain an improvement of 5.7% and 3.5% respectively on the image and the sentence retrieval tasks.
arXiv Detail & Related papers (2020-08-12T11:02:40Z) - Graph Optimal Transport for Cross-Domain Alignment [121.80313648519203]
Cross-domain alignment is fundamental to computer vision and natural language processing.
We propose Graph Optimal Transport (GOT), a principled framework that germinates from recent advances in Optimal Transport (OT)
Experiments show consistent outperformance of GOT over baselines across a wide range of tasks.
arXiv Detail & Related papers (2020-06-26T01:14:23Z) - RANSAC-Flow: generic two-stage image alignment [53.11926395028508]
We show that a simple unsupervised approach performs surprisingly well across a range of tasks.
Despite its simplicity, our method shows competitive results on a range of tasks and datasets.
arXiv Detail & Related papers (2020-04-03T12:37:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.