Video-based Person Re-identification with Long Short-Term Representation
Learning
- URL: http://arxiv.org/abs/2308.03703v1
- Date: Mon, 7 Aug 2023 16:22:47 GMT
- Title: Video-based Person Re-identification with Long Short-Term Representation
Learning
- Authors: Xuehu Liu and Pingping Zhang and Huchuan Lu
- Abstract summary: Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
- Score: 101.62570747820541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person Re-Identification (V-ReID) aims to retrieve specific
persons from raw videos captured by non-overlapped cameras. As a fundamental
task, it spreads many multimedia and computer vision applications. However, due
to the variations of persons and scenes, there are still many obstacles that
must be overcome for high performance. In this work, we notice that both the
long-term and short-term information of persons are important for robust video
representations. Thus, we propose a novel deep learning framework named Long
Short-Term Representation Learning (LSTRL) for effective V-ReID. More
specifically, to extract long-term representations, we propose a
Multi-granularity Appearance Extractor (MAE), in which four granularity
appearances are effectively captured across multiple frames. Meanwhile, to
extract short-term representations, we propose a Bi-direction Motion Estimator
(BME), in which reciprocal motion information is efficiently extracted from
consecutive frames. The MAE and BME are plug-and-play and can be easily
inserted into existing networks for efficient feature learning. As a result,
they significantly improve the feature representation ability for V-ReID.
Extensive experiments on three widely used benchmarks show that our proposed
approach can deliver better performances than most state-of-the-arts.
Related papers
- Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Chain-of-Spot: Interactive Reasoning Improves Large Vision-Language Models [81.71651422951074]
Chain-of-Spot (CoS) method is a novel approach that enhances feature extraction by focusing on key regions of interest.
This technique allows LVLMs to access more detailed visual information without altering the original image resolution.
Our empirical findings demonstrate a significant improvement in LVLMs' ability to understand and reason about visual content.
arXiv Detail & Related papers (2024-03-19T17:59:52Z) - Just a Glimpse: Rethinking Temporal Information for Video Continual
Learning [58.7097258722291]
We propose a novel replay mechanism for effective video continual learning based on individual/single frames.
Under extreme memory constraints, video diversity plays a more significant role than temporal information.
Our method achieves state-of-the-art performance, outperforming the previous state-of-the-art by up to 21.49%.
arXiv Detail & Related papers (2023-05-28T19:14:25Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.