Related papers: TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification

TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification

URL: http://arxiv.org/abs/2312.09627v1
Date: Fri, 15 Dec 2023 09:10:05 GMT
Title: TF-CLIP: Learning Text-free CLIP for Video-based Person Re-Identification
Authors: Chenyang Yu and Xuehu Liu and Yingquan Wang and Pingping Zhang and Huchuan Lu
Abstract summary: We propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature. Our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID.
Score: 60.5843635938469
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large-scale language-image pre-trained models (e.g., CLIP) have shown superior performances on many cross-modal retrieval tasks. However, the problem of transferring the knowledge learned from such models to video-based person re-identification (ReID) has barely been explored. In addition, there is a lack of decent text descriptions in current ReID benchmarks. To address these issues, in this work, we propose a novel one-stage text-free CLIP-based learning framework named TF-CLIP for video-based person ReID. More specifically, we extract the identity-specific sequence feature as the CLIP-Memory to replace the text feature. Meanwhile, we design a Sequence-Specific Prompt (SSP) module to update the CLIP-Memory online. To capture temporal information, we further propose a Temporal Memory Diffusion (TMD) module, which consists of two key components: Temporal Memory Construction (TMC) and Memory Diffusion (MD). Technically, TMC allows the frame-level memories in a sequence to communicate with each other, and to extract temporal information based on the relations within the sequence. MD further diffuses the temporal memories to each token in the original features to obtain more robust sequence features. Extensive experiments demonstrate that our proposed method shows much better results than other state-of-the-art methods on MARS, LS-VID and iLIDS-VID. The code is available at https://github.com/AsuradaYuci/TF-CLIP.

Related papers

Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
ReWind: Understanding Long Videos with Instructed Learnable Memory [8.002949551539297]
Vision-Language Models (VLMs) are crucial for applications requiring integrated understanding textual and visual information. We introduce ReWind, a novel memory-based VLM designed for efficient long video understanding while preserving temporal fidelity. We empirically demonstrate ReWind's superior performance in visual question answering (VQA) and temporal grounding tasks, surpassing previous methods on long video benchmarks.
arXiv Detail & Related papers (2024-11-23T13:23:22Z)
Semantic Residual Prompts for Continual Learning [21.986800282078498]
We show that our method significantly outperforms both state-of-the-art CL approaches and the zero-shot CLIP test. Our findings hold true even for datasets with a substantial domain gap w.r.t. the pre-training knowledge of the backbone model.
arXiv Detail & Related papers (2024-03-11T16:23:38Z)
SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data [69.20254987896674]
SpeechCLIP is an innovative framework that bridges speech and text through images via CLIP without relying on text transcription. This paper introduces two extensions to SpeechCLIP. First, we apply the Continuous Integrate-and-Fire (CIF) module to replace a fixed number of CLS tokens in the cascaded architecture. Second, we propose a new hybrid architecture that merges the cascaded and parallel architectures of SpeechCLIP into a multi-task learning framework.
arXiv Detail & Related papers (2024-02-10T14:26:42Z)
AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning [53.32576252950481]
Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks.
arXiv Detail & Related papers (2023-05-19T07:39:17Z)
Enhancing Large Language Model with Self-Controlled Memory Framework [56.38025154501917]
Large Language Models (LLMs) are constrained by their inability to process lengthy inputs, resulting in the loss of critical historical information. We propose the Self-Controlled Memory (SCM) framework to enhance the ability of LLMs to maintain long-term memory and recall relevant information.
arXiv Detail & Related papers (2023-04-26T07:25:31Z)
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring [82.84513669453744]
Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs. We revisit temporal modeling in the context of image-to-video knowledge transferring. We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
arXiv Detail & Related papers (2023-01-26T14:12:02Z)
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval [31.7091206926183]
The CLIP (Contrastive Language-Image Pre-training) has demonstrated the power of visual concepts learning from web collected image-text datasets. We propose a CLIP4Clip model to transfer the knowledge of the CLIP model to video-language retrieval in an end-to-end manner.
arXiv Detail & Related papers (2021-04-18T13:59:50Z)
Temporal Complementary Learning for Video Person Re-Identification [110.43147302200101]
This paper proposes a Temporal Complementary Learning Network that extracts complementary features of consecutive video frames for video person re-identification. A saliency erasing operation drives the specific learner to mine new and complementary parts by erasing the parts activated by previous frames. A Temporal Saliency Boosting (TSB) module is designed to propagate the salient information among video frames to enhance the salient feature.
arXiv Detail & Related papers (2020-07-18T07:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.