SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances
- URL: http://arxiv.org/abs/2601.05535v1
- Date: Fri, 09 Jan 2026 05:22:58 GMT
- Title: SAS-VPReID: A Scale-Adaptive Framework with Shape Priors for Video-based Person Re-Identification at Extreme Far Distances
- Authors: Qiwei Yang, Pingping Zhang, Yuhao Wang, Zijing Gong,
- Abstract summary: Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras.<n>At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise.<n>We propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID.
- Score: 30.963383617202755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based Person Re-IDentification (VPReID) aims to retrieve the same person from videos captured by non-overlapping cameras. At extreme far distances, VPReID is highly challenging due to severe resolution degradation, drastic viewpoint variation and inevitable appearance noise. To address these issues, we propose a Scale-Adaptive framework with Shape Priors for VPReID, named SAS-VPReID. The framework is built upon three complementary modules. First, we deploy a Memory-Enhanced Visual Backbone (MEVB) to extract discriminative feature representations, which leverages the CLIP vision encoder and multi-proxy memory. Second, we propose a Multi-Granularity Temporal Modeling (MGTM) to construct sequences at multiple temporal granularities and adaptively emphasize motion cues across scales. Third, we incorporate Prior-Regularized Shape Dynamics (PRSD) to capture body structure dynamics. With these modules, our framework can obtain more discriminative feature representations. Experiments on the VReID-XFD benchmark demonstrate the effectiveness of each module and our final framework ranks the first on the VReID-XFD challenge leaderboard. The source code is available at https://github.com/YangQiWei3/SAS-VPReID.
Related papers
- DreamID-V:Bridging the Image-to-Video Gap for High-Fidelity Face Swapping via Diffusion Transformer [21.788582116033684]
Video Face Swapping (VFS) requires seamlessly injecting a source identity into a target video.<n>Existing methods struggle to maintain identity similarity and attribute preservation while preserving temporal consistency.<n>We propose a comprehensive framework to seamlessly transfer the superiority of Image Face Swapping to the video domain.
arXiv Detail & Related papers (2026-01-04T08:07:11Z) - X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification [79.37768038337971]
We propose a novel cross-modality feature learning framework named X-ReID for VVI-ReID.<n> Specifically, we first propose a Cross-modality Prototype Collaboration (CPC)<n>Then, a Multi-granularity Information Interaction (MII) is designed, incorporating short-term interactions from adjacent frames, long-term cross-frame information fusion, and cross-modality feature alignment.
arXiv Detail & Related papers (2025-11-22T07:57:15Z) - BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation [70.27358326228399]
We propose a BasicAVSR for Arbitrary-scale video super-resolution (AVSR)<n>AVSR aims to enhance the resolution of video frames, potentially various scaling factors.<n>We show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed.
arXiv Detail & Related papers (2025-10-30T05:08:45Z) - SD-ReID: View-aware Stable Diffusion for Aerial-Ground Person Re-Identification [74.36139886192495]
We propose a novel generative framework named SD-ReID for AG-ReID.<n>We first train a ViT-based model to extract person representations along with controllable conditions, including identity and view conditions.<n>We then fine-tune the Stable Diffusion (SD) model to enhance person representations guided by these controllable conditions.
arXiv Detail & Related papers (2025-04-13T12:44:50Z) - AG-VPReID: A Challenging Large-Scale Benchmark for Aerial-Ground Video-based Person Re-Identification [39.350429734981184]
We introduce AG-VPReID, a new large-scale dataset for aerial-ground video-based person re-identification (ReID)<n>This dataset comprises 6,632 subjects, 32,321 tracklets and over 9.6 million frames captured by drones (altitudes ranging from 15-120m), CCTV, and wearable cameras.<n>We propose AG-VPReID-Net, an end-to-end framework composed of three complementary streams.
arXiv Detail & Related papers (2025-03-11T07:38:01Z) - The Devil is in Temporal Token: High Quality Video Reasoning Segmentation [68.33080352141653]
Methods for Video Reasoning rely heavily on a single special token to represent the object in the video.<n>We propose VRS-HQ, an end-to-end video reasoning segmentation approach.<n>Our results highlight the strong temporal reasoning and segmentation capabilities of our method.
arXiv Detail & Related papers (2025-01-15T03:17:24Z) - Arbitrary-Scale Video Super-Resolution with Structural and Textural Priors [80.92195378575671]
We describe a strong baseline for Arbitra-scale super-resolution (AVSR)
We then introduce ST-AVSR by equipping our baseline with a multi-scale structural and textural prior computed from the pre-trained VGG network.
Comprehensive experiments show that ST-AVSR significantly improves super-resolution quality, generalization ability, and inference speed over the state-of-theart.
arXiv Detail & Related papers (2024-07-13T15:27:39Z) - HAT: Hierarchical Aggregation Transformers for Person Re-identification [87.02828084991062]
We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
arXiv Detail & Related papers (2021-07-13T09:34:54Z) - Reference-Aided Part-Aligned Feature Disentangling for Video Person
Re-Identification [18.13546384207381]
We propose a textbfReference-textbfAided textbfPart-textbfAligned (textbfRAPA) framework to disentangle robust features of different parts.
By using both modules, the informative parts of pedestrian in videos are well aligned and more discriminative feature representation is generated.
arXiv Detail & Related papers (2021-03-21T06:53:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.