Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
- URL: http://arxiv.org/abs/2308.06897v1
- Date: Mon, 14 Aug 2023 02:26:49 GMT
- Title: Orthogonal Temporal Interpolation for Zero-Shot Video Recognition
- Authors: Yan Zhu, Junbao Zhuo, Bin Ma, Jiajia Geng, Xiaoming Wei, Xiaolin Wei,
Shuhui Wang
- Abstract summary: Zero-shot video recognition (ZSVR) is a task that aims to recognize video categories that have not been seen during the model training process.
Recent vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability for ZSVR.
- Score: 45.53856045374685
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot video recognition (ZSVR) is a task that aims to recognize video
categories that have not been seen during the model training process. Recently,
vision-language models (VLMs) pre-trained on large-scale image-text pairs have
demonstrated impressive transferability for ZSVR. To make VLMs applicable to
the video domain, existing methods often use an additional temporal learning
module after the image-level encoder to learn the temporal relationships among
video frames. Unfortunately, for video from unseen categories, we observe an
abnormal phenomenon where the model that uses spatial-temporal feature performs
much worse than the model that removes temporal learning module and uses only
spatial feature. We conjecture that improper temporal modeling on video
disrupts the spatial feature of the video. To verify our hypothesis, we propose
Feature Factorization to retain the orthogonal temporal feature of the video
and use interpolation to construct refined spatial-temporal feature. The model
using appropriately refined spatial-temporal feature performs better than the
one using only spatial feature, which verifies the effectiveness of the
orthogonal temporal feature for the ZSVR task. Therefore, an Orthogonal
Temporal Interpolation module is designed to learn a better refined
spatial-temporal video feature during training. Additionally, a Matching Loss
is introduced to improve the quality of the orthogonal temporal feature. We
propose a model called OTI for ZSVR by employing orthogonal temporal
interpolation and the matching loss based on VLMs. The ZSVR accuracies on
popular video datasets (i.e., Kinetics-600, UCF101 and HMDB51) show that OTI
outperforms the previous state-of-the-art method by a clear margin.
Related papers
- When Spatial meets Temporal in Action Recognition [34.53091498930863]
We introduce the Temporal Integration and Motion Enhancement (TIME) layer, a novel preprocessing technique designed to incorporate temporal information.
The TIME layer generates new video frames by rearranging the original sequence, preserving temporal order while embedding $N2$ temporally evolving frames into a single spatial grid.
Our experiments show that the TIME layer enhances recognition accuracy, offering valuable insights for video processing tasks.
arXiv Detail & Related papers (2024-11-22T16:39:45Z) - Disentangling Spatial and Temporal Learning for Efficient Image-to-Video
Transfer Learning [59.26623999209235]
We present DiST, which disentangles the learning of spatial and temporal aspects of videos.
The disentangled learning in DiST is highly efficient because it avoids the back-propagation of massive pre-trained parameters.
Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps.
arXiv Detail & Related papers (2023-09-14T17:58:33Z) - Leaping Into Memories: Space-Time Deep Feature Synthesis [93.10032043225362]
We propose LEAPS, an architecture-independent method for synthesizing videos from internal models.
We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of architectures convolutional attention-based on Kinetics-400.
arXiv Detail & Related papers (2023-03-17T12:55:22Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - TCGL: Temporal Contrastive Graph for Self-supervised Video
Representation Learning [79.77010271213695]
We propose a novel video self-supervised learning framework named Temporal Contrastive Graph Learning (TCGL)
Our TCGL integrates the prior knowledge about the frame and snippet orders into graph structures, i.e., the intra-/inter- snippet Temporal Contrastive Graphs (TCG)
To generate supervisory signals for unlabeled videos, we introduce an Adaptive Snippet Order Prediction (ASOP) module.
arXiv Detail & Related papers (2021-12-07T09:27:56Z) - Video Salient Object Detection via Contrastive Features and Attention
Modules [106.33219760012048]
We propose a network with attention modules to learn contrastive features for video salient object detection.
A co-attention formulation is utilized to combine the low-level and high-level features.
We show that the proposed method requires less computation, and performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2021-11-03T17:40:32Z) - StyleVideoGAN: A Temporal Generative Model using a Pretrained StyleGAN [70.31913835035206]
We present a novel approach to the video synthesis problem that helps to greatly improve visual quality.
We make use of a pre-trained StyleGAN network, the latent space of which allows control over the appearance of the objects it was trained for.
Our temporal architecture is then trained not on sequences of RGB frames, but on sequences of StyleGAN latent codes.
arXiv Detail & Related papers (2021-07-15T09:58:15Z) - Learning Self-Similarity in Space and Time as Generalized Motion for
Action Recognition [42.175450800733785]
We propose a rich motion representation based on video self-similarity (STSS)
We leverage the whole volume of STSSS and let our model learn to extract an effective motion representation from it.
The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision.
arXiv Detail & Related papers (2021-02-14T07:32:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.