Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners
- URL: http://arxiv.org/abs/2205.10747v2
- Date: Tue, 24 May 2022 17:39:06 GMT
- Title: Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners
- Authors: Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong
Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, Shih-Fu Chang,
Mohit Bansal, Heng Ji
- Abstract summary: We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
- Score: 167.0346394848718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of this work is to build flexible video-language models that can
generalize to various video-to-text tasks from few examples, such as
domain-specific captioning, question answering, and future event prediction.
Existing few-shot video-language learners focus exclusively on the encoder,
resulting in the absence of a video-to-text decoder to handle generative tasks.
Video captioners have been pretrained on large-scale video-language datasets,
but they rely heavily on finetuning and lack the ability to generate text for
unseen tasks in a few-shot setting. We propose VidIL, a few-shot Video-language
Learner via Image and Language models, which demonstrates strong performance on
few-shot video-to-text tasks without the necessity of pretraining or finetuning
on any video datasets. We use the image-language models to translate the video
content into frame captions, object, attribute, and event phrases, and compose
them into a temporal structure template. We then instruct a language model,
with a prompt containing a few in-context examples, to generate a target output
from the composed content. The flexibility of prompting allows the model to
capture any form of text input, such as automatic speech recognition (ASR)
transcripts. Our experiments demonstrate the power of language models in
understanding videos on a wide variety of video-language tasks, including video
captioning, video question answering, video caption retrieval, and video future
event prediction. Especially, on video future event prediction, our few-shot
model significantly outperforms state-of-the-art supervised models trained on
large-scale video datasets. Code and resources are publicly available for
research purposes at https://github.com/MikeWangWZHL/VidIL .
Related papers
- Towards Holistic Language-video Representation: the language model-enhanced MSR-Video to Text Dataset [4.452729255042396]
A more robust and holistic language-video representation is the key to pushing video understanding forward.
The current plain and simple text descriptions and the visual-only focus for the language-video tasks result in a limited capacity in real-world natural language video retrieval tasks.
This paper introduces a method to automatically enhance video-language datasets, making them more modality and context-aware.
arXiv Detail & Related papers (2024-06-19T20:16:17Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - SPOT! Revisiting Video-Language Models for Event Understanding [31.49859545456809]
We introduce SPOT Prober, to benchmark existing video-language models's capacities of distinguishing event-level discrepancies.
We evaluate the existing video-language models with these positive and negative captions and find they fail to distinguish most of the manipulated events.
Based on our findings, we propose to plug in these manipulated event captions as hard negative samples and find them effective in enhancing models for event understanding.
arXiv Detail & Related papers (2023-11-21T18:43:07Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Analyzing Zero-Shot Abilities of Vision-Language Models on Video
Understanding Tasks [6.925770576386087]
We propose a detailed study on the generalization abilities of image-text models when evaluated on video understanding tasks in a zero-shot setting.
Our experiments show that image-text models exhibit impressive performance on video AR, video RT and video MC.
These findings shed a light on the benefits of adapting foundational image-text models to an array of video tasks while avoiding the costly pretraining step.
arXiv Detail & Related papers (2023-10-07T20:57:54Z) - Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning [93.6842670770983]
Vid2Seq is a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks.
arXiv Detail & Related papers (2023-02-27T19:53:49Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.