KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos
- URL: http://arxiv.org/abs/2507.07393v3
- Date: Thu, 17 Jul 2025 02:04:22 GMT
- Title: KeyRe-ID: Keypoint-Guided Person Re-Identification using Part-Aware Representation in Videos
- Authors: Jinseong Kim, Jeonghoon Song, Gyeongseon Baek, Byeongjoon Noh,
- Abstract summary: We propose a keypoint-guided video-based person reidentification framework consisting of global and local branches that leverage human keypoints for enhanced learning.<n>Experiments on MARS benchmarks demonstrate state-of-the-art performance, achieving 91.73% mAP and 97.32% Rank-1 accuracy.<n>The code for this work will be publicly available upon publication on GitHub.
- Score: 0.07499722271664144
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose \textbf{KeyRe-ID}, a keypoint-guided video-based person re-identification framework consisting of global and local branches that leverage human keypoints for enhanced spatiotemporal representation learning. The global branch captures holistic identity semantics through Transformer-based temporal aggregation, while the local branch dynamically segments body regions based on keypoints to generate fine-grained, part-aware features. Extensive experiments on MARS and iLIDS-VID benchmarks demonstrate state-of-the-art performance, achieving 91.73\% mAP and 97.32\% Rank-1 accuracy on MARS, and 96.00\% Rank-1 and 100.0\% Rank-5 accuracy on iLIDS-VID. The code for this work will be publicly available on GitHub upon publication.
Related papers
- Other Tokens Matter: Exploring Global and Local Features of Vision Transformers for Object Re-Identification [63.147482497821166]
We first explore the influence of global and local features of ViT and then propose a novel Global-Local Transformer (GLTrans) for high-performance object Re-ID.
Our proposed method achieves superior performance on four object Re-ID benchmarks.
arXiv Detail & Related papers (2024-04-23T12:42:07Z) - Keypoint-Augmented Self-Supervised Learning for Medical Image
Segmentation with Limited Annotation [21.203307064937142]
We present a keypointaugmented fusion layer that extracts representations preserving both short- and long-range self-attention.
In particular, we augment the CNN feature map at multiple scales by incorporating an additional input that learns long-range spatial selfattention.
Our method further outperforms existing SSL methods by producing more robust self-attention.
arXiv Detail & Related papers (2023-10-02T22:31:30Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - R-MAE: Regions Meet Masked Autoencoders [113.73147144125385]
We explore regions as a potential visual analogue of words for self-supervised image representation learning.
Inspired by Masked Autoencoding (MAE), a generative pre-training baseline, we propose masked region autoencoding to learn from groups of pixels or regions.
arXiv Detail & Related papers (2023-06-08T17:56:46Z) - OAMatcher: An Overlapping Areas-based Network for Accurate Local Feature
Matching [9.006654114778073]
We propose OAMatcher, a detector-free method that imitates humans behavior to generate dense and accurate matches.
OAMatcher predicts overlapping areas to promote effective and clean global context aggregation.
Comprehensive experiments demonstrate that OAMatcher outperforms the state-of-the-art methods on several benchmarks.
arXiv Detail & Related papers (2023-02-12T03:32:45Z) - Distilling Facial Knowledge With Teacher-Tasks:
Semantic-Segmentation-Features For Pose-Invariant Face-Recognition [1.1811442086145123]
The proposed Seg-Distilled-ID network jointly learns identification and semantic-segmentation tasks, where the segmentation task is then "distilled"
Performance is benchmarked against three state-of-the-art encoders on a publicly available data-set.
Experimental evaluations show the Seg-Distilled-ID network shows notable benefits, achieving 99.9% test-accuracy in comparison to 81.6% on ResNet-101, 96.1% on VGG-19 and 96.3% on InceptionV3.
arXiv Detail & Related papers (2022-09-02T15:24:22Z) - Pyramid Region-based Slot Attention Network for Temporal Action Proposal
Generation [17.01865793062819]
temporal action proposal generation can largely benefit from proper temporal and semantic context exploitation.
We present a novel Pyramid Region-based Slot Attention Network PRSA-Net to learn a unified visual representation with rich temporal and semantic context.
arXiv Detail & Related papers (2022-06-21T03:40:58Z) - Gait Recognition in the Wild: A Large-scale Benchmark and NAS-based
Baseline [95.88825497452716]
Gait benchmarks empower the research community to train and evaluate high-performance gait recognition systems.
GREW is the first large-scale dataset for gait recognition in the wild.
SPOSGait is the first NAS-based gait recognition model.
arXiv Detail & Related papers (2022-05-05T14:57:39Z) - Global-Local Dynamic Feature Alignment Network for Person
Re-Identification [5.202841879001503]
We propose a simple and efficient Local Sliding Alignment (LSA) strategy to dynamically align the local features of two images by setting a sliding window on the local stripes of the pedestrian.
LSA can effectively suppress spatial misalignment and does not need to introduce extra supervision information.
We introduce LSA into the local branch of GLDFA-Net to guide the computation of distance metrics, which can further improve the accuracy of the testing phase.
arXiv Detail & Related papers (2021-09-13T07:53:36Z) - Watching You: Global-guided Reciprocal Learning for Video-based Person
Re-identification [82.6971648465279]
We propose a novel Global-guided Reciprocal Learning framework for video-based person Re-ID.
Our approach can achieve better performance than other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-07T12:27:42Z) - Effective Action Recognition with Embedded Key Point Shifts [19.010874017607247]
We propose a novel temporal feature extraction module, named Key Point Shifts Embedding Module ($KPSEM$)
Key points are adaptively extracted as feature points with maximum feature values at split regions, while key point shifts are the spatial displacements of corresponding key points.
Our method achieves competitive performance through embedding key point shifts with trivial computational cost.
arXiv Detail & Related papers (2020-08-26T05:19:04Z) - SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A
Learnable Scene Descriptor [51.298760338410624]
We propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information.
The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene.
We also design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label.
arXiv Detail & Related papers (2020-01-24T16:53:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.