MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
- URL: http://arxiv.org/abs/2404.06511v1
- Date: Tue, 9 Apr 2024 17:59:31 GMT
- Title: MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
- Authors: Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, Cordelia Schmid,
- Abstract summary: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework.
Unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event, a grounding stage, and a final reasoning stage in conjunction with an external memory.
- Score: 101.25249395748794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage, modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However, through a simple and effective baseline, we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus, unlike traditional single-stage planning methods, we propose a multi-stage system consisting of an event parser, a grounding stage, and a final reasoning stage in conjunction with an external memory. All stages are training-free, and performed using few-shot prompting of large models, creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity, our method, MoReVQA, improves over prior work on standard videoQA benchmarks (NExT-QA, iVQA, EgoSchema, ActivityNet-QA) with state-of-the-art results, and extensions to related tasks (grounded videoQA, paragraph captioning).
Related papers
- Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos [35.974750867072345]
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos.
We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence.
We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models.
arXiv Detail & Related papers (2024-08-26T17:58:47Z) - Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering [7.429984955853609]
We present Q-ViD, a simple approach for video question answering (video QA)
Q-ViD relies on a single instruction-aware open vision-language model (InstructBLIP) to tackle videoQA using frame descriptions.
arXiv Detail & Related papers (2024-02-16T13:59:07Z) - STAIR: Spatial-Temporal Reasoning with Auditable Intermediate Results
for Video Question Answering [42.173245795917026]
We propose STAIR, a Spatial-Temporal Reasoning model with Auditable Intermediate Results for video question answering.
STAIR is a neural module network, which contains a program generator to decompose a given question into a hierarchical combination of several sub-tasks.
We conduct extensive experiments on several video question answering datasets to show STAIR's performance, explainability, compatibility with pre-trained models, and applicability when program annotations are not available.
arXiv Detail & Related papers (2024-01-08T14:01:59Z) - RTQ: Rethinking Video-language Understanding Based on Image-text Model [55.278942477715084]
Video-language understanding presents unique challenges due to the inclusion of highly complex semantic details.
We propose a novel framework called RTQ, which addresses these challenges simultaneously.
Our model demonstrates outstanding performance even in the absence of video-language pre-training.
arXiv Detail & Related papers (2023-12-01T04:51:01Z) - Faithful Question Answering with Monte-Carlo Planning [78.02429369951363]
We propose FAME (FAithful question answering with MontE-carlo planning) to answer questions based on faithful reasoning steps.
We formulate the task as a discrete decision-making problem and solve it through the interaction of a reasoning environment and a controller.
FAME achieves state-of-the-art performance on the standard benchmark.
arXiv Detail & Related papers (2023-05-04T05:21:36Z) - MuLTI: Efficient Video-and-Language Understanding with Text-Guided
MultiWay-Sampler and Multiple Choice Modeling [7.737755720567113]
This paper proposes MuLTI, a highly accurate and efficient video-and-language understanding model.
We design a Text-Guided MultiWay-Sampler based on adapt-pooling residual mapping and self-attention modules.
We also propose a new pretraining task named Multiple Choice Modeling.
arXiv Detail & Related papers (2023-03-10T05:22:39Z) - mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image
and Video [89.19867891570945]
mPLUG-2 is a new unified paradigm with modularized design for multi-modal pretraining.
It shares common universal modules for modality collaboration and disentangling different modality modules to deal with modality entanglement.
It is flexible to select different modules for different understanding and generation tasks across all modalities including text, image, and video.
arXiv Detail & Related papers (2023-02-01T12:40:03Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Text Modular Networks: Learning to Decompose Tasks in the Language of
Existing Models [61.480085460269514]
We propose a framework for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models.
We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator.
arXiv Detail & Related papers (2020-09-01T23:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.