Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video
- URL: http://arxiv.org/abs/2001.09308v1
- Date: Sat, 25 Jan 2020 13:07:43 GMT
- Title: Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video
- Authors: Zhenfang Chen, Lin Ma, Wenhan Luo, Peng Tang, Kwan-Yee K. Wong
- Abstract summary: Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
- Score: 53.69956349097428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of weakly-supervised temporal grounding
of sentence in video. Specifically, given an untrimmed video and a query
sentence, our goal is to localize a temporal segment in the video that
semantically corresponds to the query sentence, with no reliance on any
temporal annotation during training. We propose a two-stage model to tackle
this problem in a coarse-to-fine manner. In the coarse stage, we first generate
a set of fixed-length temporal proposals using multi-scale sliding windows, and
match their visual features against the sentence features to identify the
best-matched proposal as a coarse grounding result. In the fine stage, we
perform a fine-grained matching between the visual features of the frames in
the best-matched proposal and the sentence features to locate the precise frame
boundary of the fine grounding result. Comprehensive experiments on the
ActivityNet Captions dataset and the Charades-STA dataset demonstrate that our
two-stage model achieves compelling performance.
Related papers
- Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Generation-Guided Multi-Level Unified Network for Video Grounding [18.402093379973085]
Video grounding aims to locate the timestamps best matching the query description within an untrimmed video.
Moment-level approaches directly predict the probability of each transient moment to be the boundary in a global perspective.
Clip-level ones aggregate the moments in different time windows into proposals and then deduce the most similar one, leading to its advantage in fine-grained grounding.
arXiv Detail & Related papers (2023-03-14T09:48:59Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Can Shuffling Video Benefit Temporal Bias Problem: A Novel Training
Framework for Temporal Grounding [20.185272219985787]
Temporal grounding aims to locate a target video moment that semantically corresponds to the given sentence query in an untrimmed video.
Previous methods do not reason the target moment locations based on the visual-textual semantic alignment but over-rely on the temporal biases of queries in training sets.
This paper proposes a novel training framework for grounding models to use shuffled videos to address temporal bias problem without losing grounding accuracy.
arXiv Detail & Related papers (2022-07-29T14:11:48Z) - Multi-Scale Self-Contrastive Learning with Hard Negative Mining for
Weakly-Supervised Query-based Video Grounding [27.05117092371221]
We propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting.
Firstly, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames.
Secondly, since some predicted frames are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm.
arXiv Detail & Related papers (2022-03-08T04:01:08Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - A Simple Yet Effective Method for Video Temporal Grounding with
Cross-Modality Attention [31.218804432716702]
The task of language-guided video temporal grounding is to localize the particular video clip corresponding to a query sentence in an untrimmed video.
We propose a simple two-branch Cross-Modality Attention (CMA) module with intuitive structure design.
In addition, we introduce a new task-specific regression loss function, which improves the temporal grounding accuracy by alleviating the impact of annotation bias.
arXiv Detail & Related papers (2020-09-23T16:03:00Z) - Regularized Two-Branch Proposal Networks for Weakly-Supervised Moment
Retrieval in Videos [108.55320735031721]
Video moment retrieval aims to localize the target moment in a video according to the given sentence.
Most existing weak-supervised methods apply a MIL-based framework to develop inter-sample confrontment.
We propose a novel Regularized Two-Branch Proposal Network to simultaneously consider the inter-sample and intra-sample confrontments.
arXiv Detail & Related papers (2020-08-19T04:42:46Z) - Where Does It Exist: Spatio-Temporal Video Grounding for Multi-Form
Sentences [107.0776836117313]
Given an un-trimmed video and a declarative/interrogative sentence, STVG aims to localize the-temporal tube of the object queried.
Existing methods cannot tackle the STVG task due to the ineffective tube pre-generation and the lack of novel object relationship modeling.
We present a declarative-Temporal Graph Reasoning Network (STGRN) for this task.
arXiv Detail & Related papers (2020-01-19T19:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.