Boosting Weakly-Supervised Temporal Action Localization with Text
Information
- URL: http://arxiv.org/abs/2305.00607v1
- Date: Mon, 1 May 2023 00:07:09 GMT
- Title: Boosting Weakly-Supervised Temporal Action Localization with Text
Information
- Authors: Guozhang Li, De Cheng, Xinpeng Ding, Nannan Wang, Xiaoyu Wang, Xinbo
Gao
- Abstract summary: We propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments.
We also introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence.
Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin.
- Score: 94.48602948837664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Due to the lack of temporal annotation, current Weakly-supervised Temporal
Action Localization (WTAL) methods are generally stuck into over-complete or
incomplete localization. In this paper, we aim to leverage the text information
to boost WTAL from two aspects, i.e., (a) the discriminative objective to
enlarge the inter-class difference, thus reducing the over-complete; (b) the
generative objective to enhance the intra-class integrity, thus finding more
complete temporal boundaries. For the discriminative objective, we propose a
Text-Segment Mining (TSM) mechanism, which constructs a text description based
on the action class label, and regards the text as the query to mine all
class-related segments. Without the temporal annotation of actions, TSM
compares the text query with the entire videos across the dataset to mine the
best matching segments while ignoring irrelevant ones. Due to the shared
sub-actions in different categories of videos, merely applying TSM is too
strict to neglect the semantic-related segments, which results in incomplete
localization. We further introduce a generative objective named Video-text
Language Completion (VLC), which focuses on all semantic-related segments from
videos to complete the text sentence. We achieve the state-of-the-art
performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our
proposed method can be seamlessly applied to existing methods, and improve
their performances with a clear margin. The code is available at
https://github.com/lgzlIlIlI/Boosting-WTAL.
Related papers
- DIAL: Dense Image-text ALignment for Weakly Supervised Semantic Segmentation [8.422110274212503]
Weakly supervised semantic segmentation approaches typically rely on class activation maps (CAMs) for initial seed generation.
We introduce DALNet, which leverages text embeddings to enhance the comprehensive understanding and precise localization of objects across different levels of granularity.
Our approach, in particular, allows for more efficient end-to-end process as a single-stage method.
arXiv Detail & Related papers (2024-09-24T06:51:49Z) - Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - Weakly-Supervised Temporal Action Localization with Bidirectional
Semantic Consistency Constraint [83.36913240873236]
Weakly Supervised Temporal Action localization (WTAL) aims to classify and localize temporal boundaries of actions for the video.
We propose a simple yet efficient method, named bidirectional semantic consistency constraint (Bi- SCC) to discriminate the positive actions from co-scene actions.
Experimental results show that our approach outperforms the state-of-the-art methods on THUMOS14 and ActivityNet.
arXiv Detail & Related papers (2023-04-25T07:20:33Z) - Learning Grounded Vision-Language Representation for Versatile
Understanding in Untrimmed Videos [57.830865926459914]
We propose a vision-language learning framework for untrimmed videos, which automatically detects informative events.
Instead of coarse-level video-language alignments, we present two dual pretext tasks to encourage fine-grained segment-level alignments.
Our framework is easily to tasks covering visually-grounded language understanding and generation.
arXiv Detail & Related papers (2023-03-11T11:00:16Z) - Fine-grained Semantic Alignment Network for Weakly Supervised Temporal
Language Grounding [148.46348699343991]
Temporal language grounding aims to localize a video segment in an untrimmed video based on a natural language description.
Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework.
We propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG.
arXiv Detail & Related papers (2022-10-21T13:10:27Z) - Boundary-aware Self-supervised Learning for Video Scene Segmentation [20.713635723315527]
Video scene segmentation is a task of temporally localizing scene boundaries in a video.
We introduce three novel boundary-aware pretext tasks: Shot-Scene Matching, Contextual Group Matching and Pseudo-boundary Prediction.
We achieve the new state-of-the-art on the MovieNet-SSeg benchmark.
arXiv Detail & Related papers (2022-01-14T02:14:07Z) - Cross-Sentence Temporal and Semantic Relations in Video Activity
Localisation [79.50868197788773]
We develop a more accurate weakly-supervised solution by introducing Cross-Sentence Relations Mining.
We explore two cross-sentence relational constraints: (1) trimmed ordering and (2) semantic consistency among sentences in a paragraph description of video activities.
Experiments on two publicly available activity localisation datasets show the advantages of our approach over the state-of-the-art weakly supervised methods.
arXiv Detail & Related papers (2021-07-23T20:04:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.