Weakly Supervised Temporal Adjacent Network for Language Grounding
- URL: http://arxiv.org/abs/2106.16136v1
- Date: Wed, 30 Jun 2021 15:42:08 GMT
- Title: Weakly Supervised Temporal Adjacent Network for Language Grounding
- Authors: Yuechen Wang, Jiajun Deng, Wengang Zhou, and Houqiang Li
- Abstract summary: We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
- Score: 96.09453060585497
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal language grounding (TLG) is a fundamental and challenging problem
for vision and language understanding. Existing methods mainly focus on fully
supervised setting with temporal boundary labels for training, which, however,
suffers expensive cost of annotation. In this work, we are dedicated to weakly
supervised TLG, where multiple description sentences are given to an untrimmed
video without temporal boundary labels. In this task, it is critical to learn a
strong cross-modal semantic alignment between sentence semantics and visual
content. To this end, we introduce a novel weakly supervised temporal adjacent
network (WSTAN) for temporal language grounding. Specifically, WSTAN learns
cross-modal semantic alignment by exploiting temporal adjacent network in a
multiple instance learning (MIL) paradigm, with a whole description paragraph
as input. Moreover, we integrate a complementary branch into the framework,
which explicitly refines the predictions with pseudo supervision from the MIL
stage. An additional self-discriminating loss is devised on both the MIL branch
and the complementary branch, aiming to enhance semantic discrimination by
self-supervising. Extensive experiments are conducted on three widely used
benchmark datasets, \emph{i.e.}, ActivityNet-Captions, Charades-STA, and
DiDeMo, and the results demonstrate the effectiveness of our approach.
Related papers
- Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint [6.880579537300643]
Current disentangled representation learning methods suffer from semantic leakage.
We propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE)
ORACLE builds upon two components: intra-class clustering and inter-class separation.
We demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.
arXiv Detail & Related papers (2024-09-24T02:01:52Z) - Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time.
Models for this task are usually trained with human-annotated sentences and bounding box supervision.
We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.