Related papers: PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification

PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification

URL: http://arxiv.org/abs/2408.16684v1
Date: Thu, 29 Aug 2024 16:31:05 GMT
Title: PartFormer: Awakening Latent Diverse Representation from Vision Transformer for Object Re-Identification
Authors: Lei Tan, Pingyang Dai, Jie Chen, Liujuan Cao, Yongjian Wu, Rongrong Ji,
Abstract summary: Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. We present PartFormer, an innovative adaptation of ViT designed to overcome the limitations in object Re-ID tasks. Our framework significantly outperforms state-of-the-art by 2.4% mAP scores on the most challenging MSMT17 dataset.
Score: 73.64560354556498
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Extracting robust feature representation is critical for object re-identification to accurately identify objects across non-overlapping cameras. Although having a strong representation ability, the Vision Transformer (ViT) tends to overfit on most distinct regions of training data, limiting its generalizability and attention to holistic object features. Meanwhile, due to the structural difference between CNN and ViT, fine-grained strategies that effectively address this issue in CNN do not continue to be successful in ViT. To address this issue, by observing the latent diverse representation hidden behind the multi-head attention, we present PartFormer, an innovative adaptation of ViT designed to overcome the granularity limitations in object Re-ID tasks. The PartFormer integrates a Head Disentangling Block (HDB) that awakens the diverse representation of multi-head self-attention without the typical loss of feature richness induced by concatenation and FFN layers post-attention. To avoid the homogenization of attention heads and promote robust part-based feature learning, two head diversity constraints are imposed: attention diversity constraint and correlation diversity constraint. These constraints enable the model to exploit diverse and discriminative feature representations from different attention heads. Comprehensive experiments on various object Re-ID benchmarks demonstrate the superiority of the PartFormer. Specifically, our framework significantly outperforms state-of-the-art by 2.4\% mAP scores on the most challenging MSMT17 dataset.

Related papers

Optimizing ID Consistency in Multimodal Large Models: Facial Restoration via Alignment, Entanglement, and Disentanglement [54.199726425201895]
Multimodal editing large models have demonstrated powerful editing capabilities across diverse tasks.<n>Current facial ID preservation methods struggle to achieve consistent restoration of both facial identity and edited element IP.<n>We propose EditedID, an Alignment-Disentanglement-Entanglement framework for robust identity-specific facial restoration.
arXiv Detail & Related papers (2026-02-21T08:24:42Z)
Unsupervised Dynamic Feature Selection for Robust Latent Spaces in Vision Tasks [5.167904179040144]
This paper presents a novel approach for enhancing latent representations using unsupervised Dynamic Feature Selection (DFS)<n>For each instance, the proposed method identifies and removes misleading or redundant information in images, ensuring that only the most relevant features contribute to the latent space.<n>Experiments conducted on image datasets demonstrate that models equipped with unsupervised DFS achieve significant improvements in generalization performance across various tasks.
arXiv Detail & Related papers (2025-10-02T07:46:59Z)
Attribute Guidance With Inherent Pseudo-label For Occluded Person Re-identification [16.586742421279137]
Attribute-Guide ReID (AG-ReID) is a novel framework to extract fine-grained semantic attributes without additional data or annotations.<n>Our framework operates through a two-stage process: first generating attribute pseudo-labels that capture subtle visual characteristics, then introducing a dual-guidance mechanism.<n>Extensive experiments demonstrate that AG-ReID achieves state-of-the-art results on multiple widely-used Re-ID datasets.
arXiv Detail & Related papers (2025-08-07T03:13:24Z)
PAT++: a cautionary tale about generative visual augmentation for Object Re-identification [0.0]
We assess the effectiveness of identity-preserving image generation for object re-identification.<n>Our results show consistent performance degradation, driven by domain shifts and failure to retain identity-defining features.<n>These findings challenge assumptions about the transferability of generative models to fine-grained recognition tasks.
arXiv Detail & Related papers (2025-07-19T15:01:05Z)
SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification [61.753607285860944]
We propose a novel two-stage feature learning framework named SD-ReID for AG-ReID. In the first stage, we train a simple ViT-based model to extract coarse-grained representations and controllable conditions. In the second stage, we fine-tune the SD model to learn complementary representations guided by the controllable conditions.
arXiv Detail & Related papers (2025-04-13T12:44:50Z)
TSDW: A Tri-Stream Dynamic Weight Network for Cloth-Changing Person Re-Identification [10.51699935302901]
Cloth-Changing Person Re-identification aims to solve the challenge of identifying individuals across different temporal-spatial scenarios. Existing ReID research primarily relies on face recognition, semantic recognition, and clothing-irrelevant feature identification. We propose a Tri-Stream Dynamic Weight Network (TSDW) that requires only images.
arXiv Detail & Related papers (2025-03-01T13:04:49Z)
LVLM-empowered Multi-modal Representation Learning for Visual Place Recognition [17.388776062997813]
We try to build a discriminative global representations by fusing image data and text descriptions of the the visual scene. The motivation is twofold: (1) Current Large Vision-Language Models (LVLMs) demonstrate extraordinary emergent capability in visual instruction following, and thus provide an efficient and flexible manner in generating text descriptions of images. Although promising, leveraging LVLMs to build multi-modal VPR solutions remains challenging in efficient multi-modal fusion.
arXiv Detail & Related papers (2024-07-09T10:15:31Z)
Dynamic Identity-Guided Attention Network for Visible-Infrared Person Re-identification [17.285526655788274]
Visible-infrared person re-identification (VI-ReID) aims to match people with the same identity between visible and infrared modalities. Existing methods generally try to bridge the cross-modal differences at image or feature level. We introduce a dynamic identity-guided attention network (DIAN) to mine identity-guided and modality-consistent embeddings.
arXiv Detail & Related papers (2024-05-21T12:04:56Z)
Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification [64.36210786350568]
We propose a novel learning framework named textbfEDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. Our framework can generate more discriminative features for multi-modal object ReID.
arXiv Detail & Related papers (2024-03-15T12:44:35Z)
DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition. DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z)
Learning Cross-modality Information Bottleneck Representation for Heterogeneous Person Re-Identification [61.49219876388174]
Visible-Infrared person re-identification (VI-ReID) is an important and challenging task in intelligent video surveillance. Existing methods mainly focus on learning a shared feature space to reduce the modality discrepancy between visible and infrared modalities. We present a novel mutual information and modality consensus network, namely CMInfoNet, to extract modality-invariant identity features.
arXiv Detail & Related papers (2023-08-29T06:55:42Z)
Feature Disentanglement Learning with Switching and Aggregation for Video-based Person Re-Identification [9.068045610800667]
In video person re-identification (Re-ID), the network must consistently extract features of the target person from successive frames. Existing methods tend to focus only on how to use temporal information, which often leads to networks being fooled by similar appearances and same backgrounds. We propose a Disentanglement and Switching and Aggregation Network (DSANet), which segregates the features representing identity and features based on camera characteristics, and pays more attention to ID information.
arXiv Detail & Related papers (2022-12-16T04:27:56Z)
Scalable Video Object Segmentation with Identification Mechanism [125.4229430216776]
This paper explores the challenges of achieving scalable and effective multi-object modeling for semi-supervised Video Object (VOS) We present two innovative approaches, Associating Objects with Transformers (AOT) and Associating Objects with Scalable Transformers (AOST) Our approaches surpass the state-of-the-art competitors and display exceptional efficiency and scalability consistently across all six benchmarks.
arXiv Detail & Related papers (2022-03-22T03:33:27Z)
Semantic Consistency and Identity Mapping Multi-Component Generative Adversarial Network for Person Re-Identification [39.605062525247135]
We propose a semantic consistency and identity mapping multi-component generative adversarial network (SC-IMGAN) which provides style adaptation from one to many domains. Our proposed method outperforms state-of-the-art techniques on six challenging person Re-ID datasets.
arXiv Detail & Related papers (2021-04-28T14:12:29Z)
Multi-Granularity Reference-Aided Attentive Feature Aggregation for Video-based Person Re-identification [98.7585431239291]
Video-based person re-identification aims at matching the same person across video clips. In this paper, we propose an attentive feature aggregation module, namely Multi-Granularity Reference-Attentive Feature aggregation module MG-RAFA. Our framework achieves the state-of-the-art ablation performance on three benchmark datasets.
arXiv Detail & Related papers (2020-03-27T03:49:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.