Hypotheses Tree Building for One-Shot Temporal Sentence Localization
- URL: http://arxiv.org/abs/2301.01871v1
- Date: Thu, 5 Jan 2023 01:50:43 GMT
- Title: Hypotheses Tree Building for One-Shot Temporal Sentence Localization
- Authors: Daizong Liu, Xiang Fang, Pan Zhou, Xing Di, Weining Lu, Yu Cheng
- Abstract summary: One-shot temporal sentence localization (one-shot TSL) learns to retrieve the query information among the entire video with only one annotated frame.
We propose an effective and novel tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment Tree (MHST)
MHST captures the query-aware discriminative frame-wise information under the insufficient annotations.
- Score: 53.82714065005299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given an untrimmed video, temporal sentence localization (TSL) aims to
localize a specific segment according to a given sentence query. Though
respectable works have made decent achievements in this task, they severely
rely on dense video frame annotations, which require a tremendous amount of
human effort to collect. In this paper, we target another more practical and
challenging setting: one-shot temporal sentence localization (one-shot TSL),
which learns to retrieve the query information among the entire video with only
one annotated frame. Particularly, we propose an effective and novel
tree-structure baseline for one-shot TSL, called Multiple Hypotheses Segment
Tree (MHST), to capture the query-aware discriminative frame-wise information
under the insufficient annotations. Each video frame is taken as the leaf-node,
and the adjacent frames sharing the same visual-linguistic semantics will be
merged into the upper non-leaf node for tree building. At last, each root node
is an individual segment hypothesis containing the consecutive frames of its
leaf-nodes. During the tree construction, we also introduce a pruning strategy
to eliminate the interference of query-irrelevant nodes. With our designed
self-supervised loss functions, our MHST is able to generate high-quality
segment hypotheses for ranking and selection with the query. Experiments on two
challenging datasets demonstrate that MHST achieves competitive performance
compared to existing methods.
Related papers
- Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Target Adaptive Context Aggregation for Video Scene Graph Generation [36.669700084337045]
This paper deals with a challenging task of video scene graph generation (VidSGG)
We present a new em detect-to-track paradigm for this task by decoupling the context modeling for relation prediction from the complicated low-level entity tracking.
arXiv Detail & Related papers (2021-08-18T12:46:28Z) - Context-aware Biaffine Localizing Network for Temporal Sentence
Grounding [61.18824806906945]
This paper addresses the problem of temporal sentence grounding (TSG)
TSG aims to identify the temporal boundary of a specific segment from an untrimmed video by a sentence query.
We propose a novel localization framework that scores all pairs of start and end indices within the video simultaneously with a biaffine mechanism.
arXiv Detail & Related papers (2021-03-22T03:13:05Z) - Linguistically Driven Graph Capsule Network for Visual Question
Reasoning [153.76012414126643]
We propose a hierarchical compositional reasoning model called the "Linguistically driven Graph Capsule Network"
The compositional process is guided by the linguistic parse tree. Specifically, we bind each capsule in the lowest layer to bridge the linguistic embedding of a single word in the original question with visual evidence.
Experiments on the CLEVR dataset, CLEVR compositional generation test, and FigureQA dataset demonstrate the effectiveness and composition generalization ability of our end-to-end model.
arXiv Detail & Related papers (2020-03-23T03:34:25Z) - Look Closer to Ground Better: Weakly-Supervised Temporal Grounding of
Sentence in Video [53.69956349097428]
Given an untrimmed video and a query sentence, our goal is to localize a temporal segment in the video that semantically corresponds to the query sentence.
We propose a two-stage model to tackle this problem in a coarse-to-fine manner.
arXiv Detail & Related papers (2020-01-25T13:07:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.