Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
  Transferring
        - URL: http://arxiv.org/abs/2301.11116v1
- Date: Thu, 26 Jan 2023 14:12:02 GMT
- Title: Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge
  Transferring
- Authors: Ruyang Liu and Jingjia Huang and Ge Li and Jiashi Feng and Xinglong Wu
  and Thomas H. Li
- Abstract summary: Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs.
We revisit temporal modeling in the context of image-to-video knowledge transferring.
We present a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks.
- Score: 82.84513669453744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract:   Image-text pretrained models, e.g., CLIP, have shown impressive general
multi-modal knowledge learned from large-scale image-text data pairs, thus
attracting increasing attention for their potential to improve visual
representation learning in the video domain. In this paper, based on the CLIP
model, we revisit temporal modeling in the context of image-to-video knowledge
transferring, which is the key point for extending image-text pretrained models
to the video domain. We find that current temporal modeling mechanisms are
tailored to either high-level semantic-dominant tasks (e.g., retrieval) or
low-level visual pattern-dominant tasks (e.g., recognition), and fail to work
on the two cases simultaneously. The key difficulty lies in modeling temporal
dependency while taking advantage of both high-level and low-level knowledge in
CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary
Network (STAN) -- a simple and effective temporal modeling mechanism extending
CLIP model to diverse video tasks. Specifically, to realize both low-level and
high-level knowledge transferring, STAN adopts a branch structure with
decomposed spatial-temporal modules that enable multi-level CLIP features to be
spatial-temporally contextualized. We evaluate our method on two representative
video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments
demonstrate the superiority of our model over the state-of-the-art methods on
various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and
Something-Something-V2. Codes will be available at
https://github.com/farewellthree/STAN
 
      
        Related papers
        - Multi-Head Attention Driven Dynamic Visual-Semantic Embedding for   Enhanced Image-Text Matching [0.8611782340880084]
 This study proposes an innovative visual semantic embedding model, Multi-Headed Consensus-Aware Visual-Semantic Embedding (MH-CVSE)
This model introduces a multi-head self-attention mechanism based on the consensus-aware visual semantic embedding model (CVSE) to capture information in multiple subspaces in parallel.
In terms of loss function design, the MH-CVSE model adopts a dynamic weight adjustment strategy to dynamically adjust the weight according to the loss value itself.
 arXiv  Detail & Related papers  (2024-12-26T11:46:22Z)
- TC-LLaVA: Rethinking the Transfer from Image to Video Understanding with   Temporal Considerations [23.188508465235717]
 We propose two strategies to enhance the model's capability in video understanding tasks.
The first approach focuses on the enhancement of Rotary Position Embedding (RoPE) with Temporal-Aware Dual RoPE.
The second approach involves enhancing the Attention Mask with the Frame-wise Block Causal Attention Mask.
 arXiv  Detail & Related papers  (2024-09-05T02:54:17Z)
- Flatten: Video Action Recognition is an Image Classification task [15.518011818978074]
 A novel video representation architecture, Flatten, serves as a plug-and-play module that can be seamlessly integrated into any image-understanding network.
Experiments on commonly used datasets have demonstrated that embedding Flatten provides significant performance improvements over original model.
 arXiv  Detail & Related papers  (2024-08-17T14:59:58Z)
- Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding [47.97650346560239]
 We propose Spatial-Temporal Auxiliary Network with Mutual-guided alignment module (Mug-STAN) to extend image-text model to diverse video tasks and video-text data.
Mug-STAN significantly improves adaptation of language-image pretrained models such as CLIP and CoCa at both video-text post-pretraining and finetuning stages.
 arXiv  Detail & Related papers  (2023-11-25T17:01:38Z)
- Structured Video-Language Modeling with Temporal Grouping and Spatial   Grounding [112.3913646778859]
 We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
 arXiv  Detail & Related papers  (2023-03-28T22:45:07Z)
- Fine-tuned CLIP Models are Efficient Video Learners [54.96069171726668]
 Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model.
Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos.
 arXiv  Detail & Related papers  (2022-12-06T18:59:58Z)
- Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
 Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
 arXiv  Detail & Related papers  (2022-02-24T14:20:04Z)
- TCGL: Temporal Contrastive Graph for Self-supervised Video
  Representation Learning [79.77010271213695]
 We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
 arXiv  Detail & Related papers  (2021-12-07T09:27:56Z)
- Enhancing Self-supervised Video Representation Learning via Multi-level
  Feature Optimization [30.670109727802494]
 This paper proposes a multi-level feature optimization framework to improve the generalization and temporal modeling ability of learned video representations.
 Experiments demonstrate that multi-level feature optimization with the graph constraint and temporal modeling can greatly improve the representation ability in video understanding.
 arXiv  Detail & Related papers  (2021-08-04T17:16:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.