Related papers: Zero-Shot Video Question Answering with Procedural Programs

Zero-Shot Video Question Answering with Procedural Programs

URL: http://arxiv.org/abs/2312.00937v1
Date: Fri, 1 Dec 2023 21:34:10 GMT
Title: Zero-Shot Video Question Answering with Procedural Programs
Authors: Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, L\'aszl\'o A. Jeni
Abstract summary: We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs. We provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets.
Score: 18.767610951412426
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose to answer zero-shot questions about videos by generating short procedural programs that derive a final answer from solving a sequence of visual subtasks. We present Procedural Video Querying (ProViQ), which uses a large language model to generate such programs from an input question and an API of visual modules in the prompt, then executes them to obtain the output. Recent similar procedural approaches have proven successful for image question answering, but videos remain challenging: we provide ProViQ with modules intended for video understanding, allowing it to generalize to a wide variety of videos. This code generation framework additionally enables ProViQ to perform other video tasks in addition to question answering, such as multi-object tracking or basic video editing. ProViQ achieves state-of-the-art results on a diverse range of benchmarks, with improvements of up to 25% on short, long, open-ended, and multimodal video question-answering datasets. Our project page is at https://rccchoudhury.github.io/proviq2023.

Related papers

Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering [7.429984955853609]
We present Q-ViD, a simple approach for video question answering (video QA) Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.
arXiv Detail & Related papers (2024-02-16T13:59:07Z)
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models [96.55004961251889]
Video Instruction Diffusion (VIDiff) is a unified foundation model designed for a wide range of video tasks. Our model can edit and translate the desired results within seconds based on user instructions. We provide convincing generative results for diverse input videos and written instructions, both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-11-30T18:59:52Z)
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules. Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z)
Locate before Answering: Answer Guided Question Localization for Video Question Answering [70.38700123685143]
LocAns integrates a question locator and an answer predictor into an end-to-end model. It achieves state-of-the-art performance on two modern long-term VideoQA datasets.
arXiv Detail & Related papers (2022-10-05T08:19:16Z)
Video Question Answering with Iterative Video-Text Co-Tokenization [77.66445727743508]
We propose a novel multi-stream video encoder for video question answering. We experimentally evaluate the model on several datasets, such as MSRVTT-QA, MSVD-QA, IVQA. Our model reduces the required GFLOPs from 150-360 to only 67, producing a highly efficient video question answering model.
arXiv Detail & Related papers (2022-08-01T15:35:38Z)
Learning to Answer Visual Questions from Web Videos [89.71617065426146]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. For a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language bias and high-quality manual annotations.
arXiv Detail & Related papers (2022-05-10T16:34:26Z)
End-to-End Video Question-Answer Generation with Generator-Pretester Network [27.31969951281815]
We study a novel task, Video Question-Answer Generation (VQAG) for challenging Video Question Answering (Video QA) task in multimedia. As captions neither fully represent a video, nor are they always practically available, it is crucial to generate question-answer pairs based on a video via Video Question-Answer Generation (VQAG) We evaluate our system with the only two available large-scale human-annotated Video QA datasets and achieves state-of-the-art question generation performances.
arXiv Detail & Related papers (2021-01-05T10:46:06Z)
Just Ask: Learning to Answer Questions from Millions of Narrated Videos [97.44376735445454]
We propose to avoid manual annotation and generate a large-scale training dataset for video question answering. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. We show our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA.
arXiv Detail & Related papers (2020-12-01T12:59:20Z)
Video Question Answering on Screencast Tutorials [43.00474548031818]
We introduce a dataset including question, answer and context triples from the tutorial videos for a software. An one-shot recognition algorithm is designed to extract the visual cues, which helps enhance the performance of video question answering.
arXiv Detail & Related papers (2020-08-02T19:27:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.