Skimming, Locating, then Perusing: A Human-Like Framework for Natural
Language Video Localization
- URL: http://arxiv.org/abs/2207.13450v1
- Date: Wed, 27 Jul 2022 10:59:33 GMT
- Title: Skimming, Locating, then Perusing: A Human-Like Framework for Natural
Language Video Localization
- Authors: Daizong Liu, Wei Hu
- Abstract summary: We propose a two-step human-like framework called Skimming-Locating-Perusing.
SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module.
Our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.
- Score: 19.46938403691984
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses the problem of natural language video localization
(NLVL). Almost all existing works follow the "only look once" framework that
exploits a single model to directly capture the complex cross- and self-modal
relations among video-query pairs and retrieve the relevant segment. However,
we argue that these methods have overlooked two indispensable characteristics
of an ideal localization method: 1) Frame-differentiable: considering the
imbalance of positive/negative video frames, it is effective to highlight
positive frames and weaken negative ones during the localization. 2)
Boundary-precise: to predict the exact segment boundary, the model should
capture more fine-grained differences between consecutive frames since their
variations are often smooth. To this end, inspired by how humans perceive and
localize a segment, we propose a two-step human-like framework called
Skimming-Locating-Perusing (SLP). SLP consists of a Skimming-and-Locating (SL)
module and a Bi-directional Perusing (BP) module. The SL module first refers to
the query semantic and selects the best matched frame from the video while
filtering out irrelevant frames. Then, the BP module constructs an initial
segment based on this frame, and dynamically updates it by exploring its
adjacent frames until no frame shares the same activity semantic. Experimental
results on three challenging benchmarks show that our SLP is superior to the
state-of-the-art methods and localizes more precise segment boundaries.
Related papers
- Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - Rethinking the Video Sampling and Reasoning Strategies for Temporal
Sentence Grounding [64.99924160432144]
Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames.
arXiv Detail & Related papers (2023-01-02T03:38:22Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - TTVFI: Learning Trajectory-Aware Transformer for Video Frame
Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames.
We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI)
Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z) - Exploring Motion Ambiguity and Alignment for High-Quality Video Frame
Interpolation [46.02120172459727]
We propose to relax the requirement of reconstructing an intermediate frame as close to the ground-truth (GT) as possible.
We develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames.
arXiv Detail & Related papers (2022-03-19T10:37:06Z) - Towards Tokenized Human Dynamics Representation [41.75534387530019]
We study how to segment and cluster videos into recurring temporal patterns in a self-supervised way.
We evaluate the frame-wise representation learning step by Kendall's Tau and the lexicon building step by normalized mutual information and language entropy.
On the AIST++ and PKU-MMD datasets, actons bring significant performance improvements compared to several baselines.
arXiv Detail & Related papers (2021-11-22T18:59:58Z) - Multi-Scale Local-Temporal Similarity Fusion for Continuous Sign
Language Recognition [4.059599144668737]
Continuous sign language recognition is a public significant task that transcribes a sign language video into an ordered gloss sequence.
One promising way is to adopt a one-dimensional convolutional network (1D-CNN) to temporally fuse the sequential frames.
We propose to adaptively fuse local features via temporal similarity for this task.
arXiv Detail & Related papers (2021-07-27T12:06:56Z) - TimeLens: Event-based Video Frame Interpolation [54.28139783383213]
We introduce Time Lens, a novel indicates equal contribution method that leverages the advantages of both synthesis-based and flow-based approaches.
We show an up to 5.21 dB improvement in terms of PSNR over state-of-the-art frame-based and event-based methods.
arXiv Detail & Related papers (2021-06-14T10:33:47Z) - Boundary Proposal Network for Two-Stage Natural Language Video
Localization [23.817486773852142]
Boundary Proposal Network (BPNet) is a universal two-stage framework that gets rid of the issues mentioned above.
In the first stage, BPNet utilizes an anchor-free model to generate a group of quality candidate video segments with their boundaries.
In the second stage, a visual-language fusion layer is proposed to jointly model the multi-language interaction between the candidate and the language query.
arXiv Detail & Related papers (2021-03-15T03:06:18Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.