Learning CLIP Guided Visual-Text Fusion Transformer for Video-based
Pedestrian Attribute Recognition
- URL: http://arxiv.org/abs/2304.10091v1
- Date: Thu, 20 Apr 2023 05:18:28 GMT
- Title: Learning CLIP Guided Visual-Text Fusion Transformer for Video-based
Pedestrian Attribute Recognition
- Authors: Jun Zhu, Jiandong Jin, Zihan Yang, Xiaohao Wu, Xiao Wang
- Abstract summary: We propose to understand human attributes using video frames that can make full use of temporal information.
We formulate the video-based PAR as a vision-language fusion problem and adopt pre-trained big models CLIP to extract the feature embeddings of given video frames.
- Score: 23.748227536306295
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing pedestrian attribute recognition (PAR) algorithms are mainly
developed based on a static image. However, the performance is not reliable for
images with challenging factors, such as heavy occlusion, motion blur, etc. In
this work, we propose to understand human attributes using video frames that
can make full use of temporal information. Specifically, we formulate the
video-based PAR as a vision-language fusion problem and adopt pre-trained big
models CLIP to extract the feature embeddings of given video frames. To better
utilize the semantic information, we take the attribute list as another input
and transform the attribute words/phrase into the corresponding sentence via
split, expand, and prompt. Then, the text encoder of CLIP is utilized for
language embedding. The averaged visual tokens and text tokens are concatenated
and fed into a fusion Transformer for multi-modal interactive learning. The
enhanced tokens will be fed into a classification head for pedestrian attribute
prediction. Extensive experiments on a large-scale video-based PAR dataset
fully validated the effectiveness of our proposed framework.
Related papers
- Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image.
We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion [23.62010759076202]
We formulate PAR as a vision-language fusion problem and fully exploit the relations between pedestrian images and attribute labels.
Our proposed PAR algorithm only adjusts 0.75% learnable parameters compared with the fine-tuning strategy.
arXiv Detail & Related papers (2023-12-17T11:59:14Z) - Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval [24.691270610091554]
In this paper, we aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts.
We obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC.
arXiv Detail & Related papers (2023-08-15T08:54:25Z) - Visual Commonsense-aware Representation Network for Video Captioning [84.67432867555044]
We propose a simple yet effective method, called Visual Commonsense-aware Representation Network (VCRN) for video captioning.
Our method reaches state-of-the-art performance, indicating the effectiveness of our method.
arXiv Detail & Related papers (2022-11-17T11:27:15Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - CLIP Meets Video Captioners: Attribute-Aware Representation Learning
Promotes Accurate Captioning [34.46948978082648]
ImageNet Pre-training (INP) is usually used to help encode the video content, and a task-oriented network is fine-tuned from scratch to cope with caption generation.
This paper investigates the potential deficiencies of INP for video captioning and explores the key to generating accurate descriptions.
We introduce Dual Attribute Prediction, an auxiliary task requiring a video caption model to learn the correspondence between video content and attributes.
arXiv Detail & Related papers (2021-11-30T06:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.