Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos
- URL: http://arxiv.org/abs/2009.08614v1
- Date: Fri, 18 Sep 2020 03:32:47 GMT
- Title: Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos
- Authors: Jie Wu, Guanbin Li, Xiaoguang Han, Liang Lin
- Abstract summary: We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
- Score: 134.78406021194985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal grounding of natural language in untrimmed videos is a fundamental
yet challenging multimedia task facilitating cross-media visual content
retrieval. We focus on the weakly supervised setting of this task that merely
accesses to coarse video-level language description annotation without temporal
boundary, which is more consistent with reality as such weak labels are more
readily available in practice. In this paper, we propose a \emph{Boundary
Adaptive Refinement} (BAR) framework that resorts to reinforcement learning
(RL) to guide the process of progressively refining the temporal boundary. To
the best of our knowledge, we offer the first attempt to extend RL to temporal
localization task with weak supervision. As it is non-trivial to obtain a
straightforward reward function in the absence of pairwise granular
boundary-query annotations, a cross-modal alignment evaluator is crafted to
measure the alignment degree of segment-query pair to provide tailor-designed
rewards. This refinement scheme completely abandons traditional sliding window
based solution pattern and contributes to acquiring more efficient,
boundary-flexible and content-aware grounding results. Extensive experiments on
two public benchmarks Charades-STA and ActivityNet demonstrate that BAR
outperforms the state-of-the-art weakly-supervised method and even beats some
competitive fully-supervised ones.
Related papers
- Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding [70.31050639330603]
Video paragraph grounding aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video.
Existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire.
We introduce and explore Weakly-Supervised Video paragraph Grounding (WSVPG) to eliminate the need of temporal annotations.
arXiv Detail & Related papers (2024-03-18T04:30:31Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Self-supervised Learning for Semi-supervised Temporal Language Grounding [84.11582376377471]
Temporal Language Grounding (TLG) aims to localize temporal boundaries of the segments that contain the specified semantics in an untrimmed video.
Previous works either tackle this task in a fully-supervised setting that requires a large amount of manual annotations or in a weakly supervised setting that cannot achieve satisfactory performance.
To achieve good performance with limited annotations, we tackle this task in a semi-supervised way and propose a unified Semi-supervised Temporal Language Grounding (STLG) framework.
arXiv Detail & Related papers (2021-09-23T16:29:16Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - Weakly Supervised Temporal Adjacent Network for Language Grounding [96.09453060585497]
We introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding.
WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm.
An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising.
arXiv Detail & Related papers (2021-06-30T15:42:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.