Related papers: Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification

Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification

URL: http://arxiv.org/abs/2311.17074v4
Date: Thu, 14 Mar 2024 05:04:06 GMT
Title: Self-Supervised Learning of Whole and Component-Based Semantic Representations for Person Re-Identification
Authors: Siyuan Huang, Yifan Zhou, Ram Prabhakar, Xijun Liu, Yuxiang Guo, Hongrui Yi, Cheng Peng, Rama Chellappa, Chun Pong Lau,
Abstract summary: Person Re-Identification (ReID) is a challenging problem, focusing on identifying individuals across diverse settings. We propose a Local Semantic Extraction (LSE) module inspired by Interactive Models to capture fine-grained, biometric, and flexible local semantics, enhancing ReID accuracy. We also introduce Semantic ReID (SemReID), a pre-training method that leverages LSE to learn effective semantics for seamless transfer across various ReID domains and modalities.
Score: 46.47881384542614
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Person Re-Identification (ReID) is a challenging problem, focusing on identifying individuals across diverse settings. However, previous ReID methods primarily concentrated on a single domain or modality, such as Clothes-Changing ReID (CC-ReID) and video ReID. Real-world ReID is not constrained by factors like clothes or input types. Recent approaches emphasize on learning semantics through pre-training to enhance ReID performance but are hindered by coarse granularity, on-clothes focus and pre-defined areas. To address these limitations, we propose a Local Semantic Extraction (LSE) module inspired by Interactive Segmentation Models. The LSE module captures fine-grained, biometric, and flexible local semantics, enhancing ReID accuracy. Additionally, we introduce Semantic ReID (SemReID), a pre-training method that leverages LSE to learn effective semantics for seamless transfer across various ReID domains and modalities. Extensive evaluations across nine ReID datasets demonstrates SemReID's robust performance across multiple domains, including clothes-changing ReID, video ReID, unconstrained ReID, and short-term ReID. Our findings highlight the importance of effective semantics in ReID, as SemReID can achieve great performances without domain-specific designs.

Related papers

CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes. Our method integrates mutualtemporal information from videos with spatial information from sampled frames. This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z)
PooDLe: Pooled and dense self-supervised learning from naturalistic videos [32.656425302538835]
We propose a novel approach that combines an invariance-based SSL objective on pooled representations with a dense SSL objective. We validate our approach on the BDD100K driving video dataset and the Walking Tours first-person video dataset.
arXiv Detail & Related papers (2024-08-20T21:40:48Z)
DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task. We first apply attention masking in each denoising step to make the generation more disentangled across different objects. In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z)
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Spatial-Temporal Attention Network for Open-Set Fine-Grained Image Recognition [14.450381668547259]
A vision transformer with the spatial self-attention mechanism could not learn accurate attention maps for distinguishing different categories of fine-grained images. We propose a spatial-temporal attention network for learning fine-grained feature representations, called STAN. The proposed STAN-OSFGR outperforms 9 state-of-the-art open-set recognition methods significantly in most cases.
arXiv Detail & Related papers (2022-11-25T07:46:42Z)
Learning Using Privileged Information for Zero-Shot Action Recognition [15.9032110752123]
This paper presents a novel method that uses object semantics as privileged information to narrow the semantic gap. Experiments on the Olympic Sports, HMDB51 and UCF101 datasets have shown that the proposed method outperforms the state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-06-17T08:46:09Z)
Stochastic Coherence Over Attention Trajectory For Continuous Learning In Video Streams [64.82800502603138]
This paper proposes a novel neural-network-based approach to progressively and autonomously develop pixel-wise representations in a video stream. The proposed method is based on a human-like attention mechanism that allows the agent to learn by observing what is moving in the attended locations. Our experiments leverage 3D virtual environments and they show that the proposed agents can learn to distinguish objects just by observing the video stream.
arXiv Detail & Related papers (2022-04-26T09:52:31Z)
Spatiotemporal Inconsistency Learning for DeepFake Video Detection [51.747219106855624]
We present a novel temporal modeling paradigm in TIM by exploiting the temporal difference over adjacent frames along with both horizontal and vertical directions. And the ISM simultaneously utilizes the spatial information from SIM and temporal information from TIM to establish a more comprehensive spatial-temporal representation.
arXiv Detail & Related papers (2021-09-04T13:05:37Z)
Self-Supervised Representation Learning from Flow Equivariance [97.13056332559526]
We present a new self-supervised learning representation framework that can be directly deployed on a video stream of complex scenes. Our representations, learned from high-resolution raw video, can be readily used for downstream tasks on static images.
arXiv Detail & Related papers (2021-01-16T23:44:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.