Self-Chained Image-Language Model for Video Localization and Question
Answering
- URL: http://arxiv.org/abs/2305.06988v2
- Date: Wed, 29 Nov 2023 21:24:35 GMT
- Title: Self-Chained Image-Language Model for Video Localization and Question
Answering
- Authors: Shoubin Yu, Jaemin Cho, Prateek Yadav, Mohit Bansal
- Abstract summary: We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
- Score: 66.86740990630433
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent studies have shown promising results on utilizing large pre-trained
image-language models for video question answering. While these image-language
models can efficiently bootstrap the representation learning of video-language
models, they typically concatenate uniformly sampled video frames as visual
inputs without explicit language-aware, temporal modeling. When only a portion
of a video input is relevant to the language query, such uniform frame sampling
can often lead to missing important visual cues. Although humans often find a
video moment to focus on and rewind the moment to answer questions, training a
query-aware video moment localizer often requires expensive annotations and
high computational costs. To address this issue, we propose Self-Chained Video
Localization-Answering (SeViLA), a novel framework that leverages a single
image-language model (BLIP-2) to tackle both temporal keyframe localization and
QA on videos. SeViLA framework consists of two modules: Localizer and Answerer,
where both are parameter-efficiently fine-tuned from BLIP-2. We propose two
ways of chaining these modules for cascaded inference and self-refinement.
First, in the forward chain, the Localizer finds multiple language-aware
keyframes in a video, which the Answerer uses to predict the answer. Second, in
the reverse chain, the Answerer generates keyframe pseudo-labels to refine the
Localizer, alleviating the need for expensive video moment localization
annotations. Our SeViLA framework outperforms several strong baselines on 5
challenging video QA and event prediction benchmarks, and achieves the
state-of-the-art in both fine-tuning (NExT-QA, STAR) and zero-shot (NExT-QA,
STAR, How2QA, VLEP) settings. We also analyze the impact of Localizer,
comparisons of Localizer with other temporal localization models,
pre-training/self-refinement of Localizer, and varying the number of keyframes.
Related papers
- Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
Vision-Language Models (VLMs) have shown remarkable capabilities across diverse visual tasks.
Current VLMs lack a fundamental cognitive ability: learning to localize objects in a scene by taking into account the context.
This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z) - VideoDistill: Language-aware Vision Distillation for Video Question Answering [24.675876324457747]
We propose VideoDistill, a framework with language-aware (i.e., goal-driven) behavior in both vision perception and answer generation process.
VideoDistill generates answers only from question-related visual embeddings.
We conduct experimental evaluations on various challenging video question-answering benchmarks, and VideoDistill achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-04-01T07:44:24Z) - Zero-Shot Dense Video Captioning by Jointly Optimizing Text and Moment [10.567291051485194]
We propose ZeroTA, a novel method for dense video captioning in a zero-shot manner.
Our method does not require any videos or annotations for training; instead, it localizes and describes events within each input video at test time.
arXiv Detail & Related papers (2023-07-05T23:01:26Z) - Revealing Single Frame Bias for Video-and-Language Learning [115.01000652123882]
We show that a single-frame trained model can achieve better performance than existing methods that use multiple frames for training.
This result reveals the existence of a strong "static appearance bias" in popular video-and-language datasets.
We propose two new retrieval tasks based on existing fine-grained action recognition datasets that encourage temporal modeling.
arXiv Detail & Related papers (2022-06-07T16:28:30Z) - Rethinking Multi-Modal Alignment in Video Question Answering from
Feature and Sample Perspectives [30.666823939595627]
This paper reconsiders the multi-modal alignment problem in VideoQA from feature and sample perspectives.
We adopt a heterogeneous graph architecture and design a hierarchical framework to align both trajectory-level and frame-level visual feature with language feature.
Our method outperforms all the state-of-the-art models on the challenging NExT-QA benchmark.
arXiv Detail & Related papers (2022-04-25T10:42:07Z) - Fine-grained Iterative Attention Network for TemporalLanguage
Localization in Videos [63.94898634140878]
Temporal language localization in videos aims to ground one video segment in an untrimmed video based on a given sentence query.
We propose a Fine-grained Iterative Attention Network (FIAN) that consists of an iterative attention module for bilateral query-video in-formation extraction.
We evaluate the proposed method on three challenging public benchmarks: Ac-tivityNet Captions, TACoS, and Charades-STA.
arXiv Detail & Related papers (2020-08-06T04:09:03Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.