Weakly Supervised Temporal Adjacent Network for Language Grounding
- URL: http://arxiv.org/abs/2106.16136v1
- Date: Wed, 30 Jun 2021 15:42:08 GMT
- Title: Weakly Supervised Temporal Adjacent Network for Language Grounding
- Authors: Yuechen Wang, Jiajun Deng, Wengang Zhou, and Houqiang Li
- Abstract summary: We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
- Score: 96.09453060585497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal language grounding (TLG) is a fundamental and challenging problem
for vision and language understanding. Existing methods mainly focus on fully
supervised setting with temporal boundary labels for training, which, however,
suffers expensive cost of annotation. In this work, we are dedicated to weakly
supervised TLG, where multiple description sentences are given to an untrimmed
video without temporal boundary labels. In this task, it is critical to learn a
strong cross-modal semantic alignment between sentence semantics and visual
content. To this end, we introduce a novel weakly supervised temporal adjacent
network (WSTAN) for temporal language grounding. Specifically, WSTAN learns
cross-modal semantic alignment by exploiting temporal adjacent network in a
multiple instance learning (MIL) paradigm, with a whole description paragraph
as input. Moreover, we integrate a complementary branch into the framework,
which explicitly refines the predictions with pseudo supervision from the MIL
stage. An additional self-discriminating loss is devised on both the MIL branch
and the complementary branch, aiming to enhance semantic discrimination by
self-supervising. Extensive experiments are conducted on three widely used
benchmark datasets, \emph{i.e.}, ActivityNet-Captions, Charades-STA, and
DiDeMo, and the results demonstrate the effectiveness of our approach.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.