Natural Language Video Localization: A Revisit in Span-based Question
Answering Framework
- URL: http://arxiv.org/abs/2102.13558v3
- Date: Tue, 2 Mar 2021 09:42:19 GMT
- Title: Natural Language Video Localization: A Revisit in Span-based Question
Answering Framework
- Authors: Hao Zhang, Aixin Sun, Wei Jing, Liangli Zhen, Joey Tianyi Zhou, Rick
Siow Mong Goh
- Abstract summary: Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query.
Existing approaches mainly solve the NLVL problem from the perspective of computer vision.
We address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage.
- Score: 56.649826885121264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural Language Video Localization (NLVL) aims to locate a target moment
from an untrimmed video that semantically corresponds to a text query. Existing
approaches mainly solve the NLVL problem from the perspective of computer
vision by formulating it as ranking, anchor, or regression tasks. These methods
suffer from large performance degradation when localizing on long videos. In
this work, we address the NLVL from a new perspective, i.e., span-based
question answering (QA), by treating the input video as a text passage. We
propose a video span localizing network (VSLNet), on top of the standard
span-based QA framework (named VSLBase), to address NLVL. VSLNet tackles the
differences between NLVL and span-based QA through a simple yet effective
query-guided highlighting (QGH) strategy. QGH guides VSLNet to search for the
matching video span within a highlighted region. To address the performance
degradation on long videos, we further extend VSLNet to VSLNet-L by applying a
multi-scale split-and-concatenation strategy. VSLNet-L first splits the
untrimmed video into short clip segments; then, it predicts which clip segment
contains the target moment and suppresses the importance of other segments.
Finally, the clip segments are concatenated, with different confidences, to
locate the target moment accurately. Extensive experiments on three benchmark
datasets show that the proposed VSLNet and VSLNet-L outperform the
state-of-the-art methods; VSLNet-L addresses the issue of performance
degradation on long videos. Our study suggests that the span-based QA framework
is an effective strategy to solve the NLVL problem.
Related papers
- PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance [44.08446730529495]
We propose a novel pooling strategy that simultaneously achieves token compression and instruction-aware visual feature aggregation.
Our model is termed Prompt-guided Pooling LLaVA, or PPLLaVA for short.
arXiv Detail & Related papers (2024-11-04T17:50:36Z) - Too Many Frames, Not All Useful: Efficient Strategies for Long-Form Video QA [40.54207548074378]
Long-form videos that span across wide temporal intervals are highly information redundant.
All information necessary to generate a correct response can often be contained within a small subset of frames.
arXiv Detail & Related papers (2024-06-13T17:59:16Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - A Simple LLM Framework for Long-Range Video Question-Answering [63.50439701867275]
We present LLoVi, a language-based framework for long-range video question-answering (LVQA)
Our approach uses a frame/clip-level visual captioner coupled with a Large Language Model (GPT-3.5, GPT-4)
Our method achieves 50.3% accuracy, outperforming the previous best-performing approach by 18.1% (absolute gain)
arXiv Detail & Related papers (2023-12-28T18:58:01Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - HierVL: Learning Hierarchical Video-Language Embeddings [108.77600799637172]
HierVL is a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations.
We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level.
Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart as well as a long-term video representation that achieves SotA.
arXiv Detail & Related papers (2023-01-05T21:53:19Z) - Span-based Localizing Network for Natural Language Video Localization [60.54191298092136]
Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query.
We propose a video span localizing network (VSLNet) to address NLVL.
arXiv Detail & Related papers (2020-04-29T02:47:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.