Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding
- URL: http://arxiv.org/abs/2210.11933v1
- Date: Fri, 21 Oct 2022 13:10:27 GMT
- Title: Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding
- Authors: Yuechen Wang, Wengang Zhou, Houqiang Li
- Abstract summary: Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
- Score: 148.46348699343991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal language grounding (TLG) aims to localize a video segment in an
untrimmed video based on a natural language description. To alleviate the
expensive cost of manual annotations for temporal boundary labels, we are
dedicated to the weakly supervised setting, where only video-level descriptions
are provided for training. Most of the existing weakly supervised methods
generate a candidate segment set and learn cross-modal alignment through a
MIL-based framework. However, the temporal structure of the video as well as
the complicated semantics in the sentence are lost during the learning. In this
work, we propose a novel candidate-free framework: Fine-grained Semantic
Alignment Network (FSAN), for weakly supervised TLG. Instead of view the
sentence and candidate moments as a whole, FSAN learns token-by-clip
cross-modal semantic alignment by an iterative cross-modal interaction module,
generates a fine-grained cross-modal semantic alignment map, and performs
grounding directly on top of the map. Extensive experiments are conducted on
two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN
achieves state-of-the-art performance.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.