Generation-Guided Multi-Level Unified Network for Video Grounding
- URL: http://arxiv.org/abs/2303.07748v1
- Date: Tue, 14 Mar 2023 09:48:59 GMT
- Title: Generation-Guided Multi-Level Unified Network for Video Grounding
- Authors: Xing Cheng, Xiangyu Wu, Dong Shen, Hezheng Lin, Fan Yang
- Abstract summary: Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective.
Clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding.
- Score: 18.402093379973085
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video grounding aims to locate the timestamps best matching the query
description within an untrimmed video. Prevalent methods can be divided into
moment-level and clip-level frameworks. Moment-level approaches directly
predict the probability of each transient moment to be the boundary in a global
perspective, and they usually perform better in coarse grounding. On the other
hand, clip-level ones aggregate the moments in different time windows into
proposals and then deduce the most similar one, leading to its advantage in
fine-grained grounding. In this paper, we propose a multi-level unified
framework to enhance performance by leveraging the merits of both moment-level
and clip-level methods. Moreover, a novel generation-guided paradigm in both
levels is adopted. It introduces a multi-modal generator to produce the
implicit boundary feature and clip feature, later regarded as queries to
calculate the boundary scores by a discriminator. The generation-guided
solution enhances video grounding from a two-unique-modals' match task to a
cross-modal attention task, which steps out of the previous framework and
obtains notable gains. The proposed Generation-guided Multi-level Unified
network (GMU) surpasses previous methods and reaches State-Of-The-Art on
various benchmarks with disparate features, e.g., Charades-STA, ActivityNet
captions.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention
and Zoom-in Boundary Detection [52.03562682785128]
Temporal video grounding aims to retrieve the time interval of a language query from an untrimmed video.
A significant challenge in TVG is the low "Semantic Noise Ratio (SNR)", which results in worse performance with lower SNR.
We propose a no-frills TVG model that consists of two core modules, namely multi-scale neighboring attention and zoom-in boundary detection.
arXiv Detail & Related papers (2023-07-20T04:12:10Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Hierarchical Local-Global Transformer for Temporal Sentence Grounding [58.247592985849124]
This paper studies the multimedia problem of temporal sentence grounding.
It aims to accurately determine the specific video segment in an untrimmed video according to a given sentence query.
arXiv Detail & Related papers (2022-08-31T14:16:56Z) - Video Is Graph: Structured Graph Module for Video Action Recognition [34.918667614077805]
We transform a video sequence into a graph to obtain direct long-term dependencies among temporal frames.
In particular, SGM divides the neighbors of each node into several temporal regions so as to extract global structural information.
The reported performance and analysis demonstrate that SGM can achieve outstanding precision with less computational complexity.
arXiv Detail & Related papers (2021-10-12T11:27:29Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z) - Boundary Proposal Network for Two-Stage Natural Language Video
Localization [23.817486773852142]
Boundary Proposal Network (BPNet) is a universal two-stage framework that gets rid of the issues mentioned above.
In the first stage, BPNet utilizes an anchor-free model to generate a group of quality candidate video segments with their boundaries.
In the second stage, a visual-language fusion layer is proposed to jointly model the multi-language interaction between the candidate and the language query.
arXiv Detail & Related papers (2021-03-15T03:06:18Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.