Rethinking the Video Sampling and Reasoning Strategies for Temporal
Sentence Grounding
- URL: http://arxiv.org/abs/2301.00514v1
- Date: Mon, 2 Jan 2023 03:38:22 GMT
- Title: Rethinking the Video Sampling and Reasoning Strategies for Temporal
Sentence Grounding
- Authors: Jiahao Zhu, Daizong Liu, Pan Zhou, Xing Di, Yu Cheng, Song Yang,
Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, Zeyu Xiong
- Abstract summary: Temporal sentence grounding (TSG) aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel Siamese Sampling and Reasoning Network (SSRN) for TSG, which introduces a siamese sampling mechanism to generate additional contextual frames.
- Score: 64.99924160432144
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding (TSG) aims to identify the temporal boundary of a
specific segment from an untrimmed video by a sentence query. All existing
works first utilize a sparse sampling strategy to extract a fixed number of
video frames and then conduct multi-modal interactions with query sentence for
reasoning. However, we argue that these methods have overlooked two
indispensable issues: 1) Boundary-bias: The annotated target segment generally
refers to two specific frames as corresponding start and end timestamps. The
video downsampling process may lose these two frames and take the adjacent
irrelevant frames as new boundaries. 2) Reasoning-bias: Such incorrect new
boundary frames also lead to the reasoning bias during frame-query interaction,
reducing the generalization ability of model. To alleviate above limitations,
in this paper, we propose a novel Siamese Sampling and Reasoning Network (SSRN)
for TSG, which introduces a siamese sampling mechanism to generate additional
contextual frames to enrich and refine the new boundaries. Specifically, a
reasoning strategy is developed to learn the inter-relationship among these
frames and generate soft labels on boundaries for more accurate frame-query
reasoning. Such mechanism is also able to supplement the absent consecutive
visual semantics to the sampled sparse frames for fine-grained activity
understanding. Extensive experiments demonstrate the effectiveness of SSRN on
three challenging datasets.
Related papers
- Temporal Sentence Grounding in Streaming Videos [60.67022943824329]
This paper aims to tackle a novel task - Temporal Sentence Grounding in Streaming Videos (TSGSV)
The goal of TSGSV is to evaluate the relevance between a video stream and a given sentence query.
We propose two novel methods: (1) a TwinNet structure that enables the model to learn about upcoming events; and (2) a language-guided feature compressor that eliminates redundant visual frames.
arXiv Detail & Related papers (2023-08-14T12:30:58Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Multi-Scale Self-Contrastive Learning with Hard Negative Mining for
Weakly-Supervised Query-based Video Grounding [27.05117092371221]
We propose a self-contrastive learning framework to address the query-based video grounding task under a weakly-supervised setting.
Firstly, we propose a new grounding scheme that learns frame-wise matching scores referring to the query semantic to predict the possible foreground frames.
Secondly, since some predicted frames are relatively coarse and exhibit similar appearance to their adjacent frames, we propose a coarse-to-fine contrastive learning paradigm.
arXiv Detail & Related papers (2022-03-08T04:01:08Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z) - Frame-wise Cross-modal Matching for Video Moment Retrieval [32.68921139236391]
Video moment retrieval targets at retrieving a moment in a video for a given language query.
The challenges of this task include 1) the requirement of localizing the relevant moment in an untrimmed video, and 2) bridging the semantic gap between textual query and video contents.
We propose an Attentive Cross-modal Relevance Matching model which predicts the temporal boundaries based on an interaction modeling.
arXiv Detail & Related papers (2020-09-22T10:25:41Z) - SF-Net: Single-Frame Supervision for Temporal Action Localization [60.202516362976645]
Single-frame supervision introduces extra temporal action signals while maintaining low annotation overhead.
We propose a unified system called SF-Net to make use of such single-frame supervision.
SF-Net significantly improves upon state-of-the-art weakly-supervised methods in terms of both segment localization and single-frame localization.
arXiv Detail & Related papers (2020-03-15T15:06:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.