A Survey on Temporal Sentence Grounding in Videos
- URL: http://arxiv.org/abs/2109.08039v2
- Date: Fri, 17 Sep 2021 01:49:19 GMT
- Title: A Survey on Temporal Sentence Grounding in Videos
- Authors: Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang and Wenwu Zhu
- Abstract summary: Temporal sentence grounding in videos(TSGV) aims to localize one target segment from an untrimmed video with respect to a given sentence query.
To the best of our knowledge, this is the first systematic survey on temporal sentence grounding.
- Score: 69.13365006222251
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal sentence grounding in videos(TSGV), which aims to localize one
target segment from an untrimmed video with respect to a given sentence query,
has drawn increasing attentions in the research community over the past few
years. Different from the task of temporal action localization, TSGV is more
flexible since it can locate complicated activities via natural languages,
without restrictions from predefined action categories. Meanwhile, TSGV is more
challenging since it requires both textual and visual understanding for
semantic alignment between two modalities(i.e., text and video). In this
survey, we give a comprehensive overview for TSGV, which i) summarizes the
taxonomy of existing methods, ii) provides a detailed description of the
evaluation protocols(i.e., datasets and metrics) to be used in TSGV, and iii)
in-depth discusses potential problems of current benchmarking designs and
research directions for further investigations. To the best of our knowledge,
this is the first systematic survey on temporal sentence grounding. More
specifically, we first discuss existing TSGV approaches by grouping them into
four categories, i.e., two-stage methods, end-to-end methods, reinforcement
learning-based methods, and weakly supervised methods. Then we present the
benchmark datasets and evaluation metrics to assess current research progress.
Finally, we discuss some limitations in TSGV through pointing out potential
problems improperly resolved in the current evaluation protocols, which may
push forwards more cutting edge research in TSGV. Besides, we also share our
insights on several promising directions, including three typical tasks with
new and practical settings based on TSGV.
Related papers
- How Texts Help? A Fine-grained Evaluation to Reveal the Role of Language in Vision-Language Tracking [23.551036494221222]
Vision-language tracking (VLT) extends traditional single object tracking by incorporating textual information.
Current VLT trackers often underperform compared to single-modality methods on multiple benchmarks.
We propose VLTVerse, the first fine-grained evaluation framework for VLT trackers.
arXiv Detail & Related papers (2024-11-23T16:31:40Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal
Action Localization [98.66318678030491]
Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training.
We propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages.
arXiv Detail & Related papers (2023-05-29T02:48:04Z) - Temporal Action Segmentation: An Analysis of Modern Techniques [43.725939095985915]
Temporal action segmentation (TAS) in videos aims at densely identifying video frames in minutes-long videos with multiple action classes.
Despite the rapid growth of TAS techniques in recent years, no systematic survey has been conducted in these sectors.
This survey analyzes and summarizes the most significant contributions and trends.
arXiv Detail & Related papers (2022-10-19T07:40:47Z) - LocVTP: Video-Text Pre-training for Temporal Localization [71.74284893790092]
Video-Text Pre-training aims to learn transferable representations for various downstream tasks from large-scale web videos.
In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks.
We propose a novel localization-oriented Video-Text Pre-training framework, dubbed as LocVTP.
arXiv Detail & Related papers (2022-07-21T08:43:51Z) - The Elements of Temporal Sentence Grounding in Videos: A Survey and
Future Directions [60.54191298092136]
Temporal sentence grounding in videos (TSGV) aims to retrieve a temporal moment that semantically corresponds to a language query from an untrimmed video.
This survey attempts to provide a summary of fundamental concepts in TSGV and current research status, as well as future research directions.
arXiv Detail & Related papers (2022-01-20T09:10:20Z) - Weakly-Supervised Multi-Level Attentional Reconstruction Network for
Grounding Textual Queries in Videos [73.4504252917816]
The task of temporally grounding textual queries in videos is to localize one video segment that semantically corresponds to the given query.
Most of the existing approaches rely on segment-sentence pairs (temporal annotations) for training, which are usually unavailable in real-world scenarios.
We present an effective weakly-supervised model, named as Multi-Level Attentional Reconstruction Network (MARN), which only relies on video-sentence pairs during the training stage.
arXiv Detail & Related papers (2020-03-16T07:01:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.