Span-based Localizing Network for Natural Language Video Localization
- URL: http://arxiv.org/abs/2004.13931v2
- Date: Sun, 14 Jun 2020 08:49:07 GMT
- Title: Span-based Localizing Network for Natural Language Video Localization
- Authors: Hao Zhang, Aixin Sun, Wei Jing, Joey Tianyi Zhou
- Abstract summary: Given an untrimmed video and a text query, natural language video localization (NLVL) is to locate a matching span from the video that semantically corresponds to the query.
We propose a video span localizing network (VSLNet) to address NLVL.
- Score: 60.54191298092136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an untrimmed video and a text query, natural language video
localization (NLVL) is to locate a matching span from the video that
semantically corresponds to the query. Existing solutions formulate NLVL either
as a ranking task and apply multimodal matching architecture, or as a
regression task to directly regress the target video span. In this work, we
address NLVL task with a span-based QA approach by treating the input video as
text passage. We propose a video span localizing network (VSLNet), on top of
the standard span-based QA framework, to address NLVL. The proposed VSLNet
tackles the differences between NLVL and span-based QA through a simple yet
effective query-guided highlighting (QGH) strategy. The QGH guides VSLNet to
search for matching video span within a highlighted region. Through extensive
experiments on three benchmark datasets, we show that the proposed VSLNet
outperforms the state-of-the-art methods; and adopting span-based QA framework
is a promising direction to solve NLVL.
Related papers
- VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Adapting Pre-trained Language Models to Vision-Language Tasks via
Dynamic Visual Prompting [83.21164539349273]
Pre-trained language models (PLMs) have played an increasing role in multimedia research.
In this paper, we focus on exploring PLMs as a stand-alone model for vision-language reasoning tasks.
We propose a novel transfer learning approach for PLMs, termed Dynamic Visual Prompting (DVP)
arXiv Detail & Related papers (2023-06-01T07:19:28Z) - GLIPv2: Unifying Localization and Vision-Language Understanding [161.1770269829139]
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks and Vision-Language (VL) understanding tasks.
GLIPv2 unifies localization pre-training and Vision-Language Pre-training with three pre-training tasks.
We show that a single GLIPv2 model achieves near SoTA performance on various localization and understanding tasks.
arXiv Detail & Related papers (2022-06-12T20:31:28Z) - Natural Language Video Localization: A Revisit in Span-based Question
Answering Framework [56.649826885121264]
Natural Language Video Localization (NLVL) aims to locate a target moment from an untrimmed video that semantically corresponds to a text query.
Existing approaches mainly solve the NLVL problem from the perspective of computer vision.
We address the NLVL from a new perspective, i.e., span-based question answering (QA), by treating the input video as a text passage.
arXiv Detail & Related papers (2021-02-26T15:57:59Z) - VLG-Net: Video-Language Graph Matching Network for Video Grounding [57.6661145190528]
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query.
We recast this challenge into an algorithmic graph matching problem.
We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets.
arXiv Detail & Related papers (2020-11-19T22:32:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.