Dynamic Token Selection for Aerial-Ground Person Re-Identification
- URL: http://arxiv.org/abs/2412.00433v2
- Date: Wed, 25 Dec 2024 10:13:58 GMT
- Title: Dynamic Token Selection for Aerial-Ground Person Re-Identification
- Authors: Yuhai Wang, Maryam Pishgar,
- Abstract summary: We propose a novel Dynamic Token Selective Transformer (DTST) tailored for AGPReID.
We segment the input image into multiple tokens, with each token representing a unique region or feature within the image.
Using a Top-k strategy, we extract the k most significant tokens that contain vital information essential for identity recognition.
- Score: 0.36832029288386137
- License:
- Abstract: Aerial-Ground Person Re-identification (AGPReID) holds significant practical value but faces unique challenges due to pronounced variations in viewing angles, lighting conditions, and background interference. Traditional methods, often involving a global analysis of the entire image, frequently lead to inefficiencies and susceptibility to irrelevant data. In this paper, we propose a novel Dynamic Token Selective Transformer (DTST) tailored for AGPReID, which dynamically selects pivotal tokens to concentrate on pertinent regions. Specifically, we segment the input image into multiple tokens, with each token representing a unique region or feature within the image. Using a Top-k strategy, we extract the k most significant tokens that contain vital information essential for identity recognition. Subsequently, an attention mechanism is employed to discern interrelations among diverse tokens, thereby enhancing the representation of identity features. Extensive experiments on benchmark datasets showcases the superiority of our method over existing works. Notably, on the CARGO dataset, our proposed method gains 1.18% mAP improvements when compared to the second place. In addition, we comprehensively analyze the impact of different numbers of tokens, token insertion positions, and numbers of heads on model performance.
Related papers
- Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets.
ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification.
We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z) - Omni-ID: Holistic Identity Representation Designed for Generative Tasks [75.29174595706533]
Omni-ID encodes holistic information about an individual's appearance across diverse expressions.
It consolidates information from a varied number of unstructured input images into a structured representation.
It demonstrates substantial improvements over conventional representations across various generative tasks.
arXiv Detail & Related papers (2024-12-12T19:21:20Z) - Disentangled Representations for Short-Term and Long-Term Person Re-Identification [33.76874948187976]
We propose a new generative adversarial network, dubbed identity shuffle GAN (IS-GAN)
It disentangles identity-related and unrelated features from person images through an identity-shuffling technique.
Experimental results validate the effectiveness of IS-GAN, showing state-of-the-art performance on standard reID benchmarks.
arXiv Detail & Related papers (2024-09-09T02:09:49Z) - PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification [73.64560354556498]
Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features.
We present PartFormer, an innovative adaptation of ViT designed to overcome the limitations in object Re-ID tasks.
Our framework significantly outperforms state-of-the-art by 2.4% mAP scores on the most challenging MSMT17 dataset.
arXiv Detail & Related papers (2024-08-29T16:31:05Z) - Learning Spectral-Decomposed Tokens for Domain Generalized Semantic Segmentation [38.0401463751139]
We present a novel Spectral-dEcomposed Token (SET) learning framework to advance the frontier.
Particularly, the frozen VFM features are first decomposed into the phase and amplitude components in the frequency space.
We develop an attention optimization method to bridge the gap between style-affected representation and static tokens during inference.
arXiv Detail & Related papers (2024-07-26T07:50:48Z) - TCFormer: Visual Recognition via Token Clustering Transformer [79.24723479088097]
We propose the Token Clustering Transformer (TCFormer), which generates dynamic vision tokens based on semantic meaning.
Our dynamic tokens possess two crucial characteristics: (1) Representing image regions with similar semantic meanings using the same vision token, even if those regions are not adjacent, and (2) concentrating on regions with valuable details and represent them using fine tokens.
arXiv Detail & Related papers (2024-07-16T02:26:18Z) - Selective Domain-Invariant Feature for Generalizable Deepfake Detection [21.671221284842847]
We propose a novel framework which reduces the sensitivity to face forgery by fusing content features and styles.
Both qualitative and quantitative results in existing benchmarks and proposals demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-03-19T13:09:19Z) - Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID.
Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z) - Multi-Stage Spatio-Temporal Aggregation Transformer for Video Person
Re-identification [78.08536797239893]
We propose a novel Multi-Stage Spatial-Temporal Aggregation Transformer (MSTAT) with two novel designed proxy embedding modules.
MSTAT consists of three stages to encode the attribute-associated, the identity-associated, and the attribute-identity-associated information from the video clips.
We show that MSTAT can achieve state-of-the-art accuracies on various standard benchmarks.
arXiv Detail & Related papers (2023-01-02T05:17:31Z) - Dynamic Prototype Mask for Occluded Person Re-Identification [88.7782299372656]
Existing methods mainly address this issue by employing body clues provided by an extra network to distinguish the visible part.
We propose a novel Dynamic Prototype Mask (DPM) based on two self-evident prior knowledge.
Under this condition, the occluded representation could be well aligned in a selected subspace spontaneously.
arXiv Detail & Related papers (2022-07-19T03:31:13Z) - City-Scale Visual Place Recognition with Deep Local Features Based on
Multi-Scale Ordered VLAD Pooling [5.274399407597545]
We present a fully-automated system for place recognition at a city-scale based on content-based image retrieval.
Firstly, we take a comprehensive analysis of visual place recognition and sketch out the unique challenges of the task.
Next, we propose yet a simple pooling approach on top of convolutional neural network activations to embed the spatial information into the image representation vector.
arXiv Detail & Related papers (2020-09-19T15:21:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.