Test of Time: Instilling Video-Language Models with a Sense of Time
- URL: http://arxiv.org/abs/2301.02074v2
- Date: Sat, 25 Mar 2023 12:44:50 GMT
- Title: Test of Time: Instilling Video-Language Models with a Sense of Time
- Authors: Piyush Bagad and Makarand Tapaswi and Cees G. M. Snoek
- Abstract summary: Seven existing video-language models struggle to understand simple temporal relations.
We propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data.
We observe encouraging performance gains especially when the task needs higher time awareness.
- Score: 42.290970800790184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modelling and understanding time remains a challenge in contemporary video
understanding models. With language emerging as a key driver towards powerful
generalization, it is imperative for foundational video-language models to have
a sense of time. In this paper, we consider a specific aspect of temporal
understanding: consistency of time order as elicited by before/after relations.
We establish that seven existing video-language models struggle to understand
even such simple temporal relations. We then question whether it is feasible to
equip these foundational models with temporal awareness without re-training
them from scratch. Towards this, we propose a temporal adaptation recipe on top
of one such model, VideoCLIP, based on post-pretraining on a small amount of
video-text data. We conduct a zero-shot evaluation of the adapted models on six
datasets for three downstream tasks which require varying degrees of time
awareness. We observe encouraging performance gains especially when the task
needs higher time awareness. Our work serves as a first step towards probing
and instilling a sense of time in existing video-language models without the
need for data and compute-intense training from scratch.
Related papers
- On the Consistency of Video Large Language Models in Temporal Comprehension [57.985769348320616]
Video large language models (Video-LLMs) can temporally ground language queries and retrieve video moments.
We conduct a study on prediction consistency -- a key indicator for robustness and trustworthiness of temporal grounding.
arXiv Detail & Related papers (2024-11-20T00:47:17Z) - Predicting Long-horizon Futures by Conditioning on Geometry and Time [49.86180975196375]
We explore the task of generating future sensor observations conditioned on the past.
We leverage the large-scale pretraining of image diffusion models which can handle multi-modality.
We create a benchmark for video prediction on a diverse set of videos spanning indoor and outdoor scenes.
arXiv Detail & Related papers (2024-04-17T16:56:31Z) - VITATECS: A Diagnostic Dataset for Temporal Concept Understanding of Video-Language Models [27.280311932711847]
We present VITATECS, a diagnostic VIdeo-Text dAtaset for the evaluation of TEmporal Concept underStanding.
We first introduce a fine-grained taxonomy of temporal concepts in natural language in order to diagnose the capability of VidLMs to comprehend different temporal aspects.
We generate counterfactual video descriptions that differ from the original one only in the specified temporal aspect.
arXiv Detail & Related papers (2023-11-29T07:15:34Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Learning Fine-Grained Visual Understanding for Video Question Answering
via Decoupling Spatial-Temporal Modeling [28.530765643908083]
We decouple spatial-temporal modeling and integrate an image- and a video-language to learn fine-grained visual understanding.
We propose a novel pre-training objective, Temporal Referring Modeling, which requires the model to identify temporal positions of events in video sequences.
Our model outperforms previous work pre-trained on orders of magnitude larger datasets.
arXiv Detail & Related papers (2022-10-08T07:03:31Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z) - Learning Temporal Dynamics from Cycles in Narrated Video [85.89096034281694]
We propose a self-supervised solution to the problem of learning to model how the world changes as time elapses.
Our model learns modality-agnostic functions to predict forward and backward in time, which must undo each other when composed.
We apply the learned dynamics model without further training to various tasks, such as predicting future action and temporally ordering sets of images.
arXiv Detail & Related papers (2021-01-07T02:41:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.