VideoCon: Robust Video-Language Alignment via Contrast Captions
- URL: http://arxiv.org/abs/2311.10111v1
- Date: Wed, 15 Nov 2023 19:51:57 GMT
- Title: VideoCon: Robust Video-Language Alignment via Contrast Captions
- Authors: Hritik Bansal, Yonatan Bitton, Idan Szpektor, Kai-Wei Chang, Aditya
Grover
- Abstract summary: Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
- Score: 80.08882631838914
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite being (pre)trained on a massive amount of data, state-of-the-art
video-language alignment models are not robust to semantically-plausible
contrastive changes in the video captions. Our work addresses this by
identifying a broad spectrum of contrast misalignments, such as replacing
entities, actions, and flipping event order, which alignment models should be
robust against. To this end, we introduce the VideoCon, a video-language
alignment dataset constructed by a large language model that generates
plausible contrast video captions and explanations for differences between
original and contrast video captions. Then, a generative video-language model
is finetuned with VideoCon to assess video-language entailment and generate
explanations. Our VideoCon-based alignment model significantly outperforms
current models. It exhibits a 12-point increase in AUC for the video-language
alignment task on human-generated contrast captions. Finally, our model sets
new state of the art zero-shot performance in temporally-extensive
video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video
question answering (ATP-Hard). Moreover, our model shows superior performance
on novel videos and human-crafted captions and explanations. Our code and data
are available at https://github.com/Hritikbansal/videocon.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Distilling Vision-Language Models on Millions of Videos [62.92789440875999]
We fine-tune a video-language model from a strong image-language baseline with synthesized instructional data.
The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions.
As a side product, we generate the largest video caption dataset to date.
arXiv Detail & Related papers (2024-01-11T18:59:53Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Contrastive Video-Language Learning with Fine-grained Frame Sampling [54.542962813921214]
FineCo is an approach to better learn video and language representations with a fine-grained contrastive objective operating on video frames.
It helps distil a video by selecting the frames that are semantically equivalent to the text, improving cross-modal correspondence.
arXiv Detail & Related papers (2022-10-10T22:48:08Z) - Language Models with Image Descriptors are Strong Few-Shot
Video-Language Learners [167.0346394848718]
We propose VidIL, a few-shot Video-language Learner via Image and Language models.
We use the image-language models to translate the video content into frame captions, object, attribute, and event phrases.
We then instruct a language model, with a prompt containing a few in-context examples, to generate a target output from the composed content.
arXiv Detail & Related papers (2022-05-22T05:18:27Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.