Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization
- URL: http://arxiv.org/abs/2308.12609v2
- Date: Sat, 9 Dec 2023 07:48:24 GMT
- Title: Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization
- Authors: Songchun Zhang, and Chunhui Zhao
- Abstract summary: Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
- Score: 23.94629999419033
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Weakly supervised temporal action localization (WSTAL) aims to localize
actions in untrimmed videos using video-level labels. Despite recent advances,
existing approaches mainly follow a localization-by-classification pipeline,
generally processing each segment individually, thereby exploiting only limited
contextual information. As a result, the model will lack a comprehensive
understanding (e.g. appearance and temporal structure) of various action
patterns, leading to ambiguity in classification learning and temporal
localization. Our work addresses this from a novel perspective, by exploring
and exploiting the cross-video contextual knowledge within the dataset to
recover the dataset-level semantic structure of action instances via weak
labels only, thereby indirectly improving the holistic understanding of
fine-grained action patterns and alleviating the aforementioned ambiguities.
Specifically, an end-to-end framework is proposed, including a Robust
Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge
Summarization and Aggregation (GKSA) module. First, the RMGCL module explores
the contrast and consistency of cross-video action features, assisting in
learning more structured and compact embedding space, thus reducing ambiguity
in classification learning. Further, the GKSA module is used to efficiently
summarize and propagate the cross-video representative action knowledge in a
learnable manner to promote holistic action patterns understanding, which in
turn allows the generation of high-confidence pseudo-labels for self-learning,
thus alleviating ambiguity in temporal localization. Extensive experiments on
THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method
outperforms the state-of-the-art methods, and can be easily plugged into other
WSTAL methods.
Related papers
- Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL)
GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval.
Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z) - SCD-Net: Spatiotemporal Clues Disentanglement Network for
Self-supervised Skeleton-based Action Recognition [39.99711066167837]
This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net)
Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively.
We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
arXiv Detail & Related papers (2023-09-11T21:32:13Z) - Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM.
It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features.
S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - Exploit CAM by itself: Complementary Learning System for Weakly
Supervised Semantic Segmentation [59.24824050194334]
This paper turns to an interesting working mechanism in agent learning named Complementary Learning System ( CLS)
Motivated by this simple but effective learning pattern, we propose a General-Specific Learning Mechanism (GSLM)
GSLM develops a General Learning Module (GLM) and a Specific Learning Module (SLM)
arXiv Detail & Related papers (2023-03-04T16:16:47Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly
Supervised Relation Extraction [24.853265244512954]
We propose a hierarchical contrastive learning Framework for DistantlySupervised relation extraction (HiCLRE) to reduce noisy sentences.
Specifically, we propose a three-level hierarchical learning framework to interact with cross levels, generating the de-noising context-aware representations.
Experiments demonstrate that HiCLRE significantly outperforms strong baselines in various mainstream DSRE datasets.
arXiv Detail & Related papers (2022-02-27T12:48:26Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top.
Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition.
We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.