Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought
- URL: http://arxiv.org/abs/2305.13903v3
- Date: Thu, 9 Nov 2023 06:50:26 GMT
- Title: Let's Think Frame by Frame with VIP: A Video Infilling and Prediction
Dataset for Evaluating Video Chain-of-Thought
- Authors: Vaishnavi Himakunthala, Andy Ouyang, Daniel Rose, Ryan He, Alex Mei,
Yujie Lu, Chinmay Sonar, Michael Saxon, William Yang Wang
- Abstract summary: We motivate framing video reasoning as the sequential understanding of a small number of video reasonings.
We introduce VIP, an inference-time challenge dataset designed to explore models' reasoning capabilities through video chain-of-thought.
We benchmark GPT-4, GPT-3, and VICUNA on VIP, demonstrate the performance gap in complex video reasoning tasks, and encourage future work.
- Score: 62.619076257298204
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite exciting recent results showing vision-language systems' capacity to
reason about images using natural language, their capacity for video reasoning
remains under-explored. We motivate framing video reasoning as the sequential
understanding of a small number of keyframes, thereby leveraging the power and
robustness of vision-language while alleviating the computational complexities
of processing videos. To evaluate this novel application, we introduce VIP, an
inference-time challenge dataset designed to explore models' reasoning
capabilities through video chain-of-thought. Inspired by visually descriptive
scene plays, we propose two formats for keyframe description: unstructured
dense captions and structured scene descriptions that identify the focus,
action, mood, objects, and setting (FAMOuS) of the keyframe. To evaluate video
reasoning, we propose two tasks: Video Infilling and Video Prediction, which
test abilities to generate multiple intermediate keyframes and predict future
keyframes, respectively. We benchmark GPT-4, GPT-3, and VICUNA on VIP,
demonstrate the performance gap in these complex video reasoning tasks, and
encourage future work to prioritize language models for efficient and
generalized video reasoning.
Related papers
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - FILS: Self-Supervised Video Feature Prediction In Semantic Language Space [11.641926922266347]
This paper demonstrates a self-supervised approach for learning semantic video representations.
We present FILS, a novel self-supervised video Feature prediction In semantic Language Space.
arXiv Detail & Related papers (2024-06-05T16:44:06Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - A Challenging Multimodal Video Summary: Simultaneously Extracting and
Generating Keyframe-Caption Pairs from Video [20.579167394855197]
This paper proposes a practical multimodal video summarization task setting and dataset to train and evaluate the task.
The target task involves summarizing a given video into a number ofcaption pairs and displaying them in a listable format to grasp the video content quickly.
This task is useful as a practical application and presents a highly challenging problem worthy of study.
arXiv Detail & Related papers (2023-12-04T02:17:14Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Video-Teller: Enhancing Cross-Modal Generation with Fusion and
Decoupling [79.49128866877922]
Video-Teller is a video-language foundation model that leverages multi-modal fusion and fine-grained modality alignment.
Video-Teller boosts the training efficiency by utilizing frozen pretrained vision and language modules.
It capitalizes on the robust linguistic capabilities of large language models, enabling the generation of both concise and elaborate video descriptions.
arXiv Detail & Related papers (2023-10-08T03:35:27Z) - Animate-A-Story: Storytelling with Retrieval-Augmented Video Generation [69.20173154096]
We develop a framework comprised of two functional modules, Motion Structure Retrieval and Structure-Guided Text-to-Video Synthesis.
For the first module, we leverage an off-the-shelf video retrieval system and extract video depths as motion structure.
For the second module, we propose a controllable video generation model that offers flexible controls over structure and characters.
arXiv Detail & Related papers (2023-07-13T17:57:13Z) - Look Before you Speak: Visually Contextualized Utterances [88.58909442073858]
We create a task for predicting utterances in a video using both visual frames and transcribed speech as context.
By exploiting the large number of instructional videos online, we train a model to solve this task at scale, without the need for manual annotations.
Our model achieves state-of-the-art performance on a number of downstream VideoQA benchmarks.
arXiv Detail & Related papers (2020-12-10T14:47:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.