Glance and Focus: Memory Prompting for Multi-Event Video Question
Answering
- URL: http://arxiv.org/abs/2401.01529v1
- Date: Wed, 3 Jan 2024 03:51:16 GMT
- Title: Glance and Focus: Memory Prompting for Multi-Event Video Question
Answering
- Authors: Ziyi Bai, Ruiping Wang, Xilin Chen
- Abstract summary: VideoQA has emerged as a vital tool to evaluate agents' ability to understand human daily behaviors.
Humans can easily tackle it by using a series of episode memories as anchors to quickly locate question-related key moments for reasoning.
We propose the Glance-Focus model to mimic this effective reasoning strategy.
- Score: 36.00733800536469
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video Question Answering (VideoQA) has emerged as a vital tool to evaluate
agents' ability to understand human daily behaviors. Despite the recent success
of large vision language models in many multi-modal tasks, complex situation
reasoning over videos involving multiple human-object interaction events still
remains challenging. In contrast, humans can easily tackle it by using a series
of episode memories as anchors to quickly locate question-related key moments
for reasoning. To mimic this effective reasoning strategy, we propose the
Glance-Focus model. One simple way is to apply an action detection model to
predict a set of actions as key memories. However, these actions within a
closed set vocabulary are hard to generalize to various video domains. Instead
of that, we train an Encoder-Decoder to generate a set of dynamic event
memories at the glancing stage. Apart from using supervised bipartite matching
to obtain the event memories, we further design an unsupervised memory
generation method to get rid of dependence on event annotations. Next, at the
focusing stage, these event memories act as a bridge to establish the
correlation between the questions with high-level event concepts and low-level
lengthy video content. Given the question, the model first focuses on the
generated key event memory, then focuses on the most relevant moment for
reasoning through our designed multi-level cross-attention mechanism. We
conduct extensive experiments on four Multi-Event VideoQA benchmarks including
STAR, EgoTaskQA, AGQA, and NExT-QA. Our proposed model achieves
state-of-the-art results, surpassing current large models in various
challenging reasoning tasks. The code and models are available at
https://github.com/ByZ0e/Glance-Focus.
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Event-aware Video Corpus Moment Retrieval [79.48249428428802]
Video Corpus Moment Retrieval (VCMR) is a practical video retrieval task focused on identifying a specific moment within a vast corpus of untrimmed videos.
Existing methods for VCMR typically rely on frame-aware video retrieval, calculating similarities between the query and video frames to rank videos.
We propose EventFormer, a model that explicitly utilizes events within videos as fundamental units for video retrieval.
arXiv Detail & Related papers (2024-02-21T06:55:20Z) - Semantic-aware Dynamic Retrospective-Prospective Reasoning for
Event-level Video Question Answering [14.659023742381777]
Event-Level Video Question Answering (EVQA) requires complex reasoning across video events to provide optimal answers.
We propose a semantic-aware dynamic retrospective-prospective reasoning approach for video-based question answering.
Our proposed approach achieves superior performance compared to previous state-of-the-art models.
arXiv Detail & Related papers (2023-05-14T03:57:11Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form
Video Question Answering [73.61182342844639]
We introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA.
MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules.
Visual concepts at different granularities are then processed efficiently through an attention module.
arXiv Detail & Related papers (2022-12-19T15:05:40Z) - Bridge-Prompt: Towards Ordinal Action Understanding in Instructional
Videos [92.18898962396042]
We propose a prompt-based framework, Bridge-Prompt, to model the semantics across adjacent actions.
We reformulate the individual action labels as integrated text prompts for supervision, which bridge the gap between individual action semantics.
Br-Prompt achieves state-of-the-art on multiple benchmarks.
arXiv Detail & Related papers (2022-03-26T15:52:27Z) - iPerceive: Applying Common-Sense Reasoning to Multi-Modal Dense Video
Captioning and Video Question Answering [0.0]
We propose iPer, a framework capable of understanding the "why" between events in a video.
We demonstrate the effectiveness of iPerceive and VideoQA as machine translation problems.
Our approach furthers the state-of-the-art in visual understanding.
arXiv Detail & Related papers (2020-11-16T05:44:45Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.