Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring
- URL: http://arxiv.org/abs/2301.11116v1
- Date: Thu, 26 Jan 2023 14:12:02 GMT
- Title: Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
Transferring
- Authors: Ruyang Liu and Jingjia Huang and Ge Li and Jiashi Feng and Xinglong Wu
and Thomas H. Li
- Abstract summary: Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
- Score: 82.84513669453744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text pretrained models, e.g., CLIP, have shown impressive general
multi-modal knowledge learned from large-scale image-text data pairs, thus
attracting increasing attention for their potential to improve visual
representation learning in the video domain. In this paper, based on the CLIP
model, we revisit temporal modeling in the context of image-to-video knowledge
transferring, which is the key point for extending image-text pretrained models
to the video domain. We find that current temporal modeling mechanisms are
tailored to either high-level semantic-dominant tasks (e.g., retrieval) or
low-level visual pattern-dominant tasks (e.g., recognition), and fail to work
on the two cases simultaneously. The key difficulty lies in modeling temporal
dependency while taking advantage of both high-level and low-level knowledge in
CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary
Network (STAN) -- a simple and effective temporal modeling mechanism extending
CLIP model to diverse video tasks. Specifically, to realize both low-level and
high-level knowledge transferring, STAN adopts a branch structure with
decomposed spatial-temporal modules that enable multi-level CLIP features to be
spatial-temporally contextualized. We evaluate our method on two representative
video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments
demonstrate the superiority of our model over the state-of-the-art methods on
various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and
Something-Something-V2. Codes will be available at
https://github.com/farewellthree/STAN
Related papers
- TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with Temporal Considerations [23.188508465235717]
We propose two strategies to enhance the model's capability in video understanding tasks.
The first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE.
The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask.
arXiv Detail & Related papers (2024-09-05T02:54:17Z) - Flatten: Video Action Recognition is an Image Classification task [15.518011818978074]
A novel video representation architecture, Flatten, serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network.
Experiments on commonly used datasets have demonstrated that embedding Flatten provides significant performance improvements over original model.
arXiv Detail & Related papers (2024-08-17T14:59:58Z) - Mug-STAN: Adapting Image-Language Pretrained Models for General Video
Understanding [47.97650346560239]
We propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN) to extend image-text model to diverse video tasks and video-text data.
Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages.
arXiv Detail & Related papers (2023-11-25T17:01:38Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Enhancing Self-supervised Video Representation Learning via Multi-level
Feature Optimization [30.670109727802494]
This paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations.
Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
arXiv Detail & Related papers (2021-08-04T17:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.