Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning
- URL: http://arxiv.org/abs/2302.14115v2
- Date: Tue, 21 Mar 2023 11:01:09 GMT
- Title: Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense
Video Captioning
- Authors: Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi
Pont-Tuset, Ivan Laptev, Josef Sivic and Cordelia Schmid
- Abstract summary: Vid2Seq is a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale.
We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries.
The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks.
- Score: 93.6842670770983
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this work, we introduce Vid2Seq, a multi-modal single-stage dense event
captioning model pretrained on narrated videos which are readily-available at
scale. The Vid2Seq architecture augments a language model with special time
tokens, allowing it to seamlessly predict event boundaries and textual
descriptions in the same output sequence. Such a unified model requires
large-scale training data, which is not available in current annotated
datasets. We show that it is possible to leverage unlabeled narrated videos for
dense video captioning, by reformulating sentence boundaries of transcribed
speech as pseudo event boundaries, and using the transcribed speech sentences
as pseudo event captions. The resulting Vid2Seq model pretrained on the
YT-Temporal-1B dataset improves the state of the art on a variety of dense
video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions.
Vid2Seq also generalizes well to the tasks of video paragraph captioning and
video clip captioning, and to few-shot settings. Our code is publicly available
at https://antoyang.github.io/vid2seq.html.
Related papers
- Streaming Dense Video Captioning [85.70265343236687]
An ideal model for dense video captioning should be able to handle long input videos, predict rich, detailed textual descriptions.
Current state-of-the-art models process a fixed number of downsampled frames, and make a single full prediction after seeing the whole video.
We propose a streaming dense video captioning model that consists of two novel components.
arXiv Detail & Related papers (2024-04-01T17:59:15Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - HowToCaption: Prompting LLMs to Transform Video Annotations at Scale [72.69268311756082]
We propose to leverage the capabilities of large language models (LLMs) to obtain high-quality video descriptions aligned with videos at scale.
We introduce a prompting method that is able to take into account a longer text of subtitles, allowing us to capture the contextual information beyond one single sentence.
We apply our method to the subtitles of the HowTo100M dataset, creating a new large-scale dataset, HowToCaption.
arXiv Detail & Related papers (2023-10-07T19:32:55Z) - Encoder-Decoder Based Long Short-Term Memory (LSTM) Model for Video
Captioning [0.0]
This work demonstrates the implementation and use of an encoder-decoder model to perform a many-to-many mapping of video data to text captions.
The many-to-many mapping occurs via an input temporal sequence of video frames to an output sequence of words to form a caption sentence.
arXiv Detail & Related papers (2023-10-02T02:32:26Z) - Scalable and Accurate Self-supervised Multimodal Representation Learning
without Aligned Video and Text Data [18.479220305684837]
Recent advances in image captioning allow us to pre-train high-quality video models without parallel video-text data.
We show that image captioning pseudolabels work better for pre-training than the existing HowTo100M ASR captions.
arXiv Detail & Related papers (2023-04-04T19:11:05Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - End-to-end Generative Pretraining for Multimodal Video Captioning [82.79187814057313]
We present Multimodal Video Generative Pretraining (MV-GPT), a new pretraining framework for learning from unlabelled videos.
Unlike recent video-language pretraining frameworks, our framework trains both a multimodal video encoder and a sentence decoder jointly.
Our model achieves state-of-the-art performance for multimodal video captioning on four standard benchmarks.
arXiv Detail & Related papers (2022-01-20T16:16:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.