Learning video embedding space with Natural Language Supervision
- URL: http://arxiv.org/abs/2303.14584v2
- Date: Sat, 8 Apr 2023 02:44:20 GMT
- Title: Learning video embedding space with Natural Language Supervision
- Authors: Phani Krishna Uppala, Abhishek Bamotra, Shriti Priya, Vaidehi Joshi
- Abstract summary: We propose a novel approach to map video embedding space to natural langugage.
We propose a two-stage approach that first extracts visual features from each frame of a video using a pre-trained CNN, and then uses the CLIP model to encode the visual features for the video domain.
- Score: 1.6822770693792823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recent success of the CLIP model has shown its potential to be applied to
a wide range of vision and language tasks. However this only establishes
embedding space relationship of language to images, not to the video domain. In
this paper, we propose a novel approach to map video embedding space to natural
langugage. We propose a two-stage approach that first extracts visual features
from each frame of a video using a pre-trained CNN, and then uses the CLIP
model to encode the visual features for the video domain, along with the
corresponding text descriptions. We evaluate our method on two benchmark
datasets, UCF101 and HMDB51, and achieve state-of-the-art performance on both
tasks.
Related papers
- VideoOFA: Two-Stage Pre-Training for Video-to-Text Generation [43.90887811621963]
We propose a new two-stage pre-training framework for video-to-text generation tasks such as video captioning and question answering.
A generative encoder-decoder model is first jointly pre-trained on massive image-language data to learn fundamental concepts.
As a result, our VideoOFA model achieves new state-the-art performance on four Video Captioning benchmarks.
arXiv Detail & Related papers (2023-05-04T23:27:21Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - OmniVL:One Foundation Model for Image-Language and Video-Language Tasks [117.57580168859512]
We present OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
We demonstrate, for the first time, such a paradigm benefits both image and video tasks, as opposed to the conventional one-directional transfer.
We introduce a novel unified vision-language contrastive (UniVLC) loss to leverage image-text, video-text, image-label (e.g., image classification), video-label (e.g., video action recognition) data together.
arXiv Detail & Related papers (2022-09-15T17:59:59Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Expanding Language-Image Pretrained Models for General Video Recognition [136.0948049010682]
Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data.
We present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly.
Our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols.
arXiv Detail & Related papers (2022-08-04T17:59:54Z) - MILES: Visual BERT Pre-training with Injected Language Semantics for
Video-text Retrieval [43.2299969152561]
Methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
Our method outperforms state-of-the-art methods for text-to-video retrieval on four datasets with both zero-shot and fine-tune evaluation protocols.
arXiv Detail & Related papers (2022-04-26T16:06:31Z) - Video-Text Pre-training with Learned Regions [59.30893505895156]
Video-Text pre-training aims at learning transferable representations from large-scale video-text pairs.
We propose a module for videotext-learning, RegionLearner, which can take into account the structure of objects during pre-training on large-scale video-text pairs.
arXiv Detail & Related papers (2021-12-02T13:06:53Z) - CLIP4Caption: CLIP for Video Caption [9.470254059503862]
We propose a CLIP4Caption framework that improves video captioning based on a CLIP-enhanced video-text matching network (VTM)
This framework is taking full advantage of the information from both vision and language and enforcing the model to learn strongly text-correlated video features for text generation.
arXiv Detail & Related papers (2021-10-13T10:17:06Z) - CLIP2Video: Mastering Video-Text Retrieval via Image CLIP [13.270902407320005]
We present CLIP2Video network to transfer the image-language training model to video-text retrieval in an end-to-end manner.
We conduct thorough ablation studies, and achieve state-of-the-art performance on text-to-video and video-to-text retrieval benchmarks.
arXiv Detail & Related papers (2021-06-21T13:30:33Z) - Learning Video Representations from Textual Web Supervision [97.78883761035557]
We propose to use text as a method for learning video representations.
We collect 70M video clips shared publicly on the Internet and train a model to pair each video with its associated text.
We find that this approach is an effective method of pre-training video representations.
arXiv Detail & Related papers (2020-07-29T16:19:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.