Revisiting the "Video" in Video-Language Understanding
- URL: http://arxiv.org/abs/2206.01720v1
- Date: Fri, 3 Jun 2022 17:57:33 GMT
- Title: Revisiting the "Video" in Video-Language Understanding
- Authors: Shyamal Buch, Crist\'obal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li
Fei-Fei, Juan Carlos Niebles
- Abstract summary: We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
- Score: 56.15777956496518
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: What makes a video task uniquely suited for videos, beyond what can be
understood from a single image? Building on recent progress in self-supervised
image-language models, we revisit this question in the context of video and
language tasks. We propose the atemporal probe (ATP), a new model for
video-language analysis which provides a stronger bound on the baseline
accuracy of multimodal models constrained by image-level understanding. By
applying this model to standard discriminative video and language tasks, such
as video question answering and text-to-video retrieval, we characterize the
limitations and potential of current video-language benchmarks. We find that
understanding of event temporality is often not necessary to achieve strong or
state-of-the-art performance, even compared with recent large-scale
video-language models and in contexts intended to benchmark deeper video-level
understanding. We also demonstrate how ATP can improve both video-language
dataset and model design. We describe a technique for leveraging ATP to better
disentangle dataset subsets with a higher concentration of temporally
challenging data, improving benchmarking efficacy for causal and temporal
understanding. Further, we show that effectively integrating ATP into full
video-level temporal models can improve efficiency and state-of-the-art
accuracy.
Related papers
- Realizing Video Summarization from the Path of Language-based Semantic Understanding [19.825666473712197]
We propose a novel video summarization framework inspired by the Mixture of Experts (MoE) paradigm.
Our approach integrates multiple VideoLLMs to generate comprehensive and coherent textual summaries.
arXiv Detail & Related papers (2024-10-06T15:03:22Z) - InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - VideoCon: Robust Video-Language Alignment via Contrast Captions [80.08882631838914]
Video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions.
Our work identifies a broad spectrum of contrast misalignments, such as replacing entities, actions, and flipping event order.
Our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks.
arXiv Detail & Related papers (2023-11-15T19:51:57Z) - Test of Time: Instilling Video-Language Models with a Sense of Time [42.290970800790184]
Seven existing video-language models struggle to understand simple temporal relations.
We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data.
We observe encouraging performance gains especially when the task needs higher time awareness.
arXiv Detail & Related papers (2023-01-05T14:14:36Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.