Learning Spatiotemporal Features via Video and Text Pair Discrimination
- URL: http://arxiv.org/abs/2001.05691v3
- Date: Thu, 28 Jan 2021 01:43:34 GMT
- Title: Learning Spatiotemporal Features via Video and Text Pair Discrimination
- Authors: Tianhao Li, Limin Wang
- Abstract summary: Cross-modal pair (CPD) framework captures correlation between video and its associated text.
We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (-300k) to demonstrate its effectiveness.
- Score: 30.64670449131973
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current video representations heavily rely on learning from manually
annotated video datasets which are time-consuming and expensive to acquire. We
observe videos are naturally accompanied by abundant text information such as
YouTube titles and Instagram captions. In this paper, we leverage this
visual-textual connection to learn spatiotemporal features in an efficient
weakly-supervised manner. We present a general cross-modal pair discrimination
(CPD) framework to capture this correlation between a video and its associated
text. Specifically, we adopt noise-contrastive estimation to tackle the
computational issue imposed by the huge amount of pair instance classes and
design a practical curriculum learning strategy. We train our CPD models on
both standard video dataset (Kinetics-210k) and uncurated web video dataset
(Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning,
the learnt models obtain competitive results for action classification on
Kinetics under the linear classification protocol. Moreover, our visual model
provides an effective initialization to fine-tune on downstream tasks, which
yields a remarkable performance gain for action recognition on UCF101 and
HMDB51, compared with the existing state-of-the-art self-supervised training
methods. In addition, our CPD model yields a new state of the art for zero-shot
action recognition on UCF101 by directly utilizing the learnt visual-textual
embeddings. The code will be made available at
https://github.com/MCG-NJU/CPD-Video.
Related papers
- Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Cross-Architecture Self-supervised Video Representation Learning [42.267775859095664]
We present a new cross-architecture contrastive learning framework for self-supervised video representation learning.
We introduce a temporal self-supervised learning module able to predict an Edit distance explicitly between two video sequences.
We evaluate our method on the tasks of video retrieval and action recognition on UCF101 and HMDB51 datasets.
arXiv Detail & Related papers (2022-05-26T12:41:19Z) - CoCon: Cooperative-Contrastive Learning [52.342936645996765]
Self-supervised visual representation learning is key for efficient video analysis.
Recent success in learning image representations suggests contrastive learning is a promising framework to tackle this challenge.
We introduce a cooperative variant of contrastive learning to utilize complementary information across views.
arXiv Detail & Related papers (2021-04-30T05:46:02Z) - Temporally Coherent Embeddings for Self-Supervised Video Representation
Learning [2.216657815393579]
This paper presents TCE: Temporally Coherent Embeddings for self-supervised video representation learning.
The proposed method exploits inherent structure of unlabeled video data to explicitly enforce temporal coherency in the embedding space.
With a simple but effective 2D-CNN backbone and only RGB stream inputs, TCE pre-trained representations outperform all previous selfsupervised 2D-CNN and 3D-CNN pre-trained on UCF101.
arXiv Detail & Related papers (2020-03-21T12:25:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.