Video-based Person Re-identification with Spatial and Temporal Memory
Networks
- URL: http://arxiv.org/abs/2108.09039v1
- Date: Fri, 20 Aug 2021 08:01:32 GMT
- Title: Video-based Person Re-identification with Spatial and Temporal Memory
Networks
- Authors: Chanho Eom, Geon Lee, Junghyup Lee, Bumsub Ham
- Abstract summary: spatial and temporal distractors in person videos make this task much more challenging than image-based person reID.
We introduce a novel Spatial and Temporal Memory Networks (STMN)
The STMN stores features for spatial distractors that frequently emerge across video frames, while the temporal memory saves attentions which are optimized for typical temporal patterns in person videos.
We leverage the spatial and temporal memories to refine frame-level person representations and to aggregate the refined frame-level features into a sequence-level person representation.
- Score: 29.66624606649384
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person re-identification (reID) aims to retrieve person videos
with the same identity as a query person across multiple cameras. Spatial and
temporal distractors in person videos, such as background clutter and partial
occlusions over frames, respectively, make this task much more challenging than
image-based person reID. We observe that spatial distractors appear
consistently in a particular location, and temporal distractors show several
patterns, e.g., partial occlusions occur in the first few frames, where such
patterns provide informative cues for predicting which frames to focus on
(i.e., temporal attentions). Based on this, we introduce a novel Spatial and
Temporal Memory Networks (STMN). The spatial memory stores features for spatial
distractors that frequently emerge across video frames, while the temporal
memory saves attentions which are optimized for typical temporal patterns in
person videos. We leverage the spatial and temporal memories to refine
frame-level person representations and to aggregate the refined frame-level
features into a sequence-level person representation, respectively, effectively
handling spatial and temporal distractors in person videos. We also introduce a
memory spread loss preventing our model from addressing particular items only
in the memories. Experimental results on standard benchmarks, including MARS,
DukeMTMC-VideoReID, and LS-VID, demonstrate the effectiveness of our method.
Related papers
- Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z) - STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.
It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.
STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z) - CrossVideoMAE: Self-Supervised Image-Video Representation Learning with Masked Autoencoders [6.159948396712944]
CrossVideoMAE learns both video-level and frame-level richtemporal representations and semantic attributes.
Our method integrates mutualtemporal information from videos with spatial information from sampled frames.
This is critical for acquiring rich, label-free guiding signals from both video and frame image modalities in a self-supervised manner.
arXiv Detail & Related papers (2025-02-08T06:15:39Z) - VrdONE: One-stage Video Visual Relation Detection [30.983521962897477]
Video Visual Relation Detection (VidVRD) focuses on understanding how entities over time and space in videos.
Traditional methods for VidVRD, challenged by its complexity, typically split the task into two parts: one for identifying what relation are present and another for determining their temporal boundaries.
We propose VrdONE, a streamlined yet efficacious one-stage model for VidVRD.
arXiv Detail & Related papers (2024-08-18T08:38:20Z) - Multi-grained Temporal Prototype Learning for Few-shot Video Object
Segmentation [156.4142424784322]
Few-Shot Video Object (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images.
We propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data.
Our proposed video IPMT model significantly outperforms previous models on two benchmark datasets.
arXiv Detail & Related papers (2023-09-20T09:16:34Z) - Video Demoireing with Relation-Based Temporal Consistency [68.20281109859998]
Moire patterns, appearing as color distortions, severely degrade image and video qualities when filming a screen with digital cameras.
We study how to remove such undesirable moire patterns in videos, namely video demoireing.
arXiv Detail & Related papers (2022-04-06T17:45:38Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction
Detection in Videos [91.29436920371003]
We propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI)
We use temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features.
We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
arXiv Detail & Related papers (2021-05-25T07:54:35Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - Learning by Aligning Videos in Time [10.075645944474287]
We present a self-supervised approach for learning video representations using temporal video alignment as a pretext task.
We leverage a novel combination of temporal alignment loss and temporal regularization terms, which can be used as supervision signals for training an encoder network.
arXiv Detail & Related papers (2021-03-31T17:55:52Z) - A Flow-Guided Mutual Attention Network for Video-Based Person
Re-Identification [25.217641512619178]
Person ReID is a challenging problem in many analytics and surveillance applications.
Video-based person ReID has recently gained much interest because it allows capturing feature discriminant-temporal information.
In this paper, the motion pattern of a person is explored as an additional cue for ReID.
arXiv Detail & Related papers (2020-08-09T18:58:11Z) - Co-Saliency Spatio-Temporal Interaction Network for Person
Re-Identification in Videos [85.6430597108455]
We propose a novel Co-Saliency Spatio-Temporal Interaction Network (CSTNet) for person re-identification in videos.
It captures the common salient foreground regions among video frames and explores the spatial-temporal long-range context interdependency from such regions.
Multiple spatialtemporal interaction modules within CSTNet are proposed, which exploit the spatial and temporal long-range context interdependencies on such features and spatial-temporal information correlation.
arXiv Detail & Related papers (2020-04-10T10:23:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.