Unsupervised Temporal Video Grounding with Deep Semantic Clustering
- URL: http://arxiv.org/abs/2201.05307v1
- Date: Fri, 14 Jan 2022 05:16:33 GMT
- Title: Unsupervised Temporal Video Grounding with Deep Semantic Clustering
- Authors: Daizong Liu, Xiaoye Qu, Yinzhen Wang, Xing Di, Kai Zou, Yu Cheng,
Zichuan Xu, Pan Zhou
- Abstract summary: Temporal video grounding aims to localize a target segment in a video according to a given sentence query.
In this paper, we explore whether a video grounding model can be learned without any paired annotations.
Considering there is no paired supervision, we propose a novel Deep Semantic Clustering Network (DSCNet) to leverage all semantic information from the whole query set.
- Score: 58.95918952149763
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal video grounding (TVG) aims to localize a target segment in a video
according to a given sentence query. Though respectable works have made decent
achievements in this task, they severely rely on abundant video-query paired
data, which is expensive and time-consuming to collect in real-world scenarios.
In this paper, we explore whether a video grounding model can be learned
without any paired annotations. To the best of our knowledge, this paper is the
first work trying to address TVG in an unsupervised setting. Considering there
is no paired supervision, we propose a novel Deep Semantic Clustering Network
(DSCNet) to leverage all semantic information from the whole query set to
compose the possible activity in each video for grounding. Specifically, we
first develop a language semantic mining module, which extracts implicit
semantic features from the whole query set. Then, these language semantic
features serve as the guidance to compose the activity in video via a
video-based semantic aggregation module. Finally, we utilize a foreground
attention branch to filter out the redundant background activities and refine
the grounding results. To validate the effectiveness of our DSCNet, we conduct
experiments on both ActivityNet Captions and Charades-STA datasets. The results
demonstrate that DSCNet achieves competitive performance, and even outperforms
most weakly-supervised approaches.
Related papers
- Dense Video Object Captioning from Disjoint Supervision [77.47084982558101]
We propose a new task and model for dense video object captioning.
This task unifies spatial and temporal localization in video.
We show how our model improves upon a number of strong baselines for this new task.
arXiv Detail & Related papers (2023-06-20T17:57:23Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Watch and Learn: Mapping Language and Noisy Real-world Videos with
Self-supervision [54.73758942064708]
We teach machines to understand visuals and natural language by learning the mapping between sentences and noisy video snippets without explicit annotations.
For training and evaluation, we contribute a new dataset ApartmenTour' that contains a large number of online videos and subtitles.
arXiv Detail & Related papers (2020-11-19T03:43:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.