Video-based Person Re-identification with Long Short-Term Representation
Learning
- URL: http://arxiv.org/abs/2308.03703v1
- Date: Mon, 7 Aug 2023 16:22:47 GMT
- Title: Video-based Person Re-identification with Long Short-Term Representation
Learning
- Authors: Xuehu Liu and Pingping Zhang and Huchuan Lu
- Abstract summary: Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
- Score: 101.62570747820541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person Re-Identification (V-ReID) aims to retrieve specific
persons from raw videos captured by non-overlapped cameras. As a fundamental
task, it spreads many multimedia and computer vision applications. However, due
to the variations of persons and scenes, there are still many obstacles that
must be overcome for high performance. In this work, we notice that both the
long-term and short-term information of persons are important for robust video
representations. Thus, we propose a novel deep learning framework named Long
Short-Term Representation Learning (LSTRL) for effective V-ReID. More
specifically, to extract long-term representations, we propose a
Multi-granularity Appearance Extractor (MAE), in which four granularity
appearances are effectively captured across multiple frames. Meanwhile, to
extract short-term representations, we propose a Bi-direction Motion Estimator
(BME), in which reciprocal motion information is efficiently extracted from
consecutive frames. The MAE and BME are plug-and-play and can be easily
inserted into existing networks for efficient feature learning. As a result,
they significantly improve the feature representation ability for V-ReID.
Extensive experiments on three widely used benchmarks show that our proposed
approach can deliver better performances than most state-of-the-arts.
Related papers
- MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.
It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.
It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.
We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - SEAL: Semantic Attention Learning for Long Video Representation [31.994155533019843]
This paper introduces SEmantic Attention Learning (SEAL), a novel unified representation for long videos.
To reduce computational complexity, long videos are decomposed into three distinct types of semantic entities.
Our representation is versatile, enabling applications across various long video understanding tasks.
arXiv Detail & Related papers (2024-12-02T18:46:12Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - Self-Supervised Video Representation Learning with Motion-Contrastive
Perception [13.860736711747284]
Motion-Contrastive Perception Network (MCPNet)
MCPNet consists of two branches, namely, Motion Information Perception (MIP) and Contrastive Instance Perception (CIP)
Our method outperforms current state-of-the-art visual-only self-supervised approaches.
arXiv Detail & Related papers (2022-04-10T05:34:46Z) - Exploring Motion and Appearance Information for Temporal Sentence
Grounding [52.01687915910648]
We propose a Motion-Appearance Reasoning Network (MARN) to solve temporal sentence grounding.
We develop separate motion and appearance branches to learn motion-guided and appearance-guided object relations.
Our proposed MARN significantly outperforms previous state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2022-01-03T02:44:18Z) - Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition [86.31412529187243]
Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
arXiv Detail & Related papers (2020-10-20T03:06:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.