TF-CLIP: Learning Text-free CLIP for Video-based Person
Re-Identification
- URL: http://arxiv.org/abs/2312.09627v1
- Date: Fri, 15 Dec 2023 09:10:05 GMT
- Title: TF-CLIP: Learning Text-free CLIP for Video-based Person
Re-Identification
- Authors: Chenyang Yu and Xuehu Liu and Yingquan Wang and Pingping Zhang and
Huchuan Lu
- Abstract summary: We propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID.
More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature.
Our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID.
- Score: 60.5843635938469
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale language-image pre-trained models (e.g., CLIP) have shown
superior performances on many cross-modal retrieval tasks. However, the problem
of transferring the knowledge learned from such models to video-based person
re-identification (ReID) has barely been explored. In addition, there is a lack
of decent text descriptions in current ReID benchmarks. To address these
issues, in this work, we propose a novel one-stage text-free CLIP-based
learning framework named TF-CLIP for video-based person ReID. More
specifically, we extract the identity-specific sequence feature as the
CLIP-Memory to replace the text feature. Meanwhile, we design a
Sequence-Specific Prompt (SSP) module to update the CLIP-Memory online. To
capture temporal information, we further propose a Temporal Memory Diffusion
(TMD) module, which consists of two key components: Temporal Memory
Construction (TMC) and Memory Diffusion (MD). Technically, TMC allows the
frame-level memories in a sequence to communicate with each other, and to
extract temporal information based on the relations within the sequence. MD
further diffuses the temporal memories to each token in the original features
to obtain more robust sequence features. Extensive experiments demonstrate that
our proposed method shows much better results than other state-of-the-art
methods on MARS, LS-VID and iLIDS-VID. The code is available at
https://github.com/AsuradaYuci/TF-CLIP.
Related papers
- ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information.
We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity.
We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z) - Semantic Residual Prompts for Continual Learning [21.986800282078498]
We show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test.
Our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model.
arXiv Detail & Related papers (2024-03-11T16:23:38Z) - SpeechCLIP+: Self-supervised multi-task representation learning for
speech via CLIP and speech-image data [69.20254987896674]
SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription.
This paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture.
Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
arXiv Detail & Related papers (2024-02-10T14:26:42Z) - AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning [53.32576252950481]
Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data.
In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks.
arXiv Detail & Related papers (2023-05-19T07:39:17Z) - Enhancing Large Language Model with Self-Controlled Memory Framework [56.38025154501917]
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information.
We propose the Self-Controlled Memory (SCM) framework to enhance the ability of LLMs to maintain long-term memory and recall relevant information.
arXiv Detail & Related papers (2023-04-26T07:25:31Z) - Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z) - CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip
Retrieval [31.7091206926183]
The CLIP (Contrastive Language-Image Pre-training) has demonstrated the power of visual concepts learning from web collected image-text datasets.
We propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
arXiv Detail & Related papers (2021-04-18T13:59:50Z) - Temporal Complementary Learning for Video Person Re-Identification [110.43147302200101]
This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification.
A saliency erasing operation drives the specific learner to mine new and complementary parts by erasing the parts activated by previous frames.
A Temporal Saliency Boosting (TSB) module is designed to propagate the salient information among video frames to enhance the salient feature.
arXiv Detail & Related papers (2020-07-18T07:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.