Related papers: Weakly Supervised Temporal Adjacent Network for Language Grounding

Weakly Supervised Temporal Adjacent Network for Language Grounding

URL: http://arxiv.org/abs/2106.16136v1
Date: Wed, 30 Jun 2021 15:42:08 GMT
Title: Weakly Supervised Temporal Adjacent Network for Language Grounding
Authors: Yuechen Wang, Jiajun Deng, Wengang Zhou, and Houqiang Li
Abstract summary: We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
Score: 96.09453060585497
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Temporal language grounding (TLG) is a fundamental and challenging problem for vision and language understanding. Existing methods mainly focus on fully supervised setting with temporal boundary labels for training, which, however, suffers expensive cost of annotation. In this work, we are dedicated to weakly supervised TLG, where multiple description sentences are given to an untrimmed video without temporal boundary labels. In this task, it is critical to learn a strong cross-modal semantic alignment between sentence semantics and visual content. To this end, we introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. Specifically, WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm, with a whole description paragraph as input. Moreover, we integrate a complementary branch into the framework, which explicitly refines the predictions with pseudo supervision from the MIL stage. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising. Extensive experiments are conducted on three widely used benchmark datasets, \emph{i.e.}, ActivityNet-Captions, Charades-STA, and DiDeMo, and the results demonstrate the effectiveness of our approach.

Related papers

Collaborative Temporal Consistency Learning for Point-supervised Natural Language Video Localization [129.43937834515688]
We propose a new COllaborative Temporal consistEncy Learning (COTEL) framework to strengthen the video-language alignment. Specifically, we first design a frame- and a segment-level Temporal Consistency Learning (TCL) module that models semantic alignment across frame saliencies and sentence-moment pairs.
arXiv Detail & Related papers (2025-03-22T05:04:12Z)
Mitigating Semantic Leakage in Cross-lingual Embeddings via Orthogonality Constraint [6.880579537300643]
Current disentangled representation learning methods suffer from semantic leakage. We propose a novel training objective, ORthogonAlity Constraint LEarning (ORACLE) ORACLE builds upon two components: intra-class clustering and inter-class separation. We demonstrate that training with the ORACLE objective effectively reduces semantic leakage and enhances semantic alignment within the embedding space.
arXiv Detail & Related papers (2024-09-24T02:01:52Z)
Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z)
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions [55.574102714832456]
spatial-temporal grounding describes the task of localizing events in space and time. Models for this task are usually trained with human-annotated sentences and bounding box supervision. We combine local representation learning, which focuses on fine-grained spatial information, with a global representation that captures higher-level representations.
arXiv Detail & Related papers (2023-03-29T19:38:23Z)
Fine-grained Semantic Alignment Network for Weakly Supervised Temporal Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z)
Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video. Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance. To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z)
Reinforcement Learning for Weakly Supervised Temporal Grounding of Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary. We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.