Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
- URL: http://arxiv.org/abs/2402.10698v2
- Date: Sun, 21 Jul 2024 00:42:15 GMT
- Title: Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
- Authors: David Romero, Thamar Solorio,
- Abstract summary: We present Q-ViD, a simple approach for video question answering (video QA)
Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.
- Score: 7.429984955853609
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Q-ViD, a simple approach for video question answering (video QA), that unlike prior methods, which are based on complex architectures, computationally expensive pipelines or use closed models like GPTs, Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions. Specifically, we create captioning instruction prompts that rely on the target questions about the videos and leverage InstructBLIP to obtain video frame captions that are useful to the task at hand. Subsequently, we form descriptions of the whole video using the question-dependent frame captions, and feed that information, along with a question-answering prompt, to a large language model (LLM). The LLM is our reasoning module, and performs the final step of multiple-choice QA. Our simple Q-ViD framework achieves competitive or even higher performances than current state of the art models on a diverse range of videoQA benchmarks, including NExT-QA, STAR, How2QA, TVQA and IntentQA.
Related papers
- VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - SparrowVQE: Visual Question Explanation for Course Content Understanding [12.926309478839652]
This paper introduces Visual Question Explanation (VQE), which enhances the ability of VQA to provide detailed explanations.
We trained our model with a three-stage training mechanism consisting of multimodal pre-training, instruction tuning, and domain fine-tuning.
Experimental results demonstrate that our SparrowVQE achieves better performance in our developed MLVQE dataset and outperforms state-of-the-art methods in the other five benchmark VQA datasets.
arXiv Detail & Related papers (2024-11-12T03:25:33Z) - Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts [3.6064695344878093]
Visual question answering (VQA) is known as an AI-complete task as it requires understanding, reasoning, and inferring about the vision and the language content.
This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline.
arXiv Detail & Related papers (2024-04-12T16:35:23Z) - MoReVQA: Exploring Modular Reasoning Models for Video Question Answering [101.25249395748794]
This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.
Unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event, a grounding stage, and a final reasoning stage in conjunction with an external memory.
arXiv Detail & Related papers (2024-04-09T17:59:31Z) - Zero-Shot Video Question Answering with Procedural Programs [18.767610951412426]
We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs.
We provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos.
ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets.
arXiv Detail & Related papers (2023-12-01T21:34:10Z) - Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering.
We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA.
Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z) - NExT-QA:Next Phase of Question-Answering to Explaining Temporal Actions [80.60423934589515]
We introduce NExT-QA, a rigorously designed video question answering (VideoQA) benchmark.
We set up multi-choice and open-ended QA tasks targeting causal action reasoning, temporal action reasoning, and common scene comprehension.
We find that top-performing methods excel at shallow scene descriptions but are weak in causal and temporal action reasoning.
arXiv Detail & Related papers (2021-05-18T04:56:46Z) - End-to-End Video Question-Answer Generation with Generator-Pretester
Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia.
As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG)
We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z) - Hierarchical Conditional Relation Networks for Multimodal Video Question
Answering [67.85579756590478]
Video QA adds at least two more layers of complexity - selecting relevant content for each channel in the context of a linguistic query.
Conditional Relation Network (CRN) takes as input a set of tensorial objects translating into a new set of objects that encode relations of the inputs.
CRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content, and long-form where associated information, such as subtitles, is presented.
arXiv Detail & Related papers (2020-10-18T02:31:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.