Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models
- URL: http://arxiv.org/abs/2207.07646v1
- Date: Fri, 15 Jul 2022 17:59:11 GMT
- Title: Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models
- Authors: Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin
Cui
- Abstract summary: We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
- Score: 67.31684040281465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Utilizing vision and language models (VLMs) pre-trained on large-scale
image-text pairs is becoming a promising paradigm for open-vocabulary visual
recognition. In this work, we extend this paradigm by leveraging motion and
audio that naturally exist in video. We present \textbf{MOV}, a simple yet
effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary
video classification. In MOV, we directly use the vision encoder from
pre-trained VLMs with minimal modifications to encode video, optical flow and
audio spectrogram. We design a cross-modal fusion mechanism to aggregate
complimentary multimodal information. Experiments on Kinetics-700 and VGGSound
show that introducing flow or audio modality brings large performance gains
over the pre-trained VLM and existing methods. Specifically, MOV greatly
improves the accuracy on base classes, while generalizes better on novel
classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video
classification benchmarks, significantly outperforming both traditional
zero-shot methods and recent methods based on VLMs. Code and models will be
released.
Related papers
- Harnessing Large Language Models for Training-free Video Anomaly Detection [34.76811491190446]
Video anomaly detection (VAD) aims to temporally locate abnormal events in a video.
Training-based methods are prone to be domain-specific, thus being costly for practical deployment.
We propose LAnguage-based VAD (LAVAD), a method tackling VAD in a novel, training-free paradigm.
arXiv Detail & Related papers (2024-04-01T09:34:55Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer [5.7254320553764]
E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks.
Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
arXiv Detail & Related papers (2023-11-28T22:57:17Z) - Eliciting In-Context Learning in Vision-Language Models for Videos Through Curated Data Distributional Properties [13.938281516499119]
We implement textbfEmergent textbfIn-context textbfLearning on textbfVideos (eilev), a novel training paradigm that induces in-context learning over video and text.
Our results, analysis, and eilev-trained models yield numerous insights about the emergence of in-context learning over video and text.
arXiv Detail & Related papers (2023-11-28T18:53:06Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Bidirectional Cross-Modal Knowledge Exploration for Video Recognition
with Pre-trained Vision-Language Models [149.1331903899298]
We propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge.
We present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner.
Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model.
arXiv Detail & Related papers (2022-12-31T11:36:53Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.