Related papers: Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization

URL: http://arxiv.org/abs/2308.12609v2
Date: Sat, 9 Dec 2023 07:48:24 GMT
Title: Cross-Video Contextual Knowledge Exploration and Exploitation for Ambiguity Reduction in Weakly Supervised Temporal Action Localization
Authors: Songchun Zhang, and Chunhui Zhao
Abstract summary: Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset. Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
Score: 23.94629999419033
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels. Despite recent advances, existing approaches mainly follow a localization-by-classification pipeline, generally processing each segment individually, thereby exploiting only limited contextual information. As a result, the model will lack a comprehensive understanding (e.g. appearance and temporal structure) of various action patterns, leading to ambiguity in classification learning and temporal localization. Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset to recover the dataset-level semantic structure of action instances via weak labels only, thereby indirectly improving the holistic understanding of fine-grained action patterns and alleviating the aforementioned ambiguities. Specifically, an end-to-end framework is proposed, including a Robust Memory-Guided Contrastive Learning (RMGCL) module and a Global Knowledge Summarization and Aggregation (GKSA) module. First, the RMGCL module explores the contrast and consistency of cross-video action features, assisting in learning more structured and compact embedding space, thus reducing ambiguity in classification learning. Further, the GKSA module is used to efficiently summarize and propagate the cross-video representative action knowledge in a learnable manner to promote holistic action patterns understanding, which in turn allows the generation of high-confidence pseudo-labels for self-learning, thus alleviating ambiguity in temporal localization. Extensive experiments on THUMOS14, ActivityNet1.3, and FineAction demonstrate that our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.

Related papers

Without Paired Labeled Data: End-to-End Self-Supervised Learning for Drone-view Geo-Localization [2.733505168507872]
Drone-view Geo-Localization (DVGL) aims to achieve accurate localization of drones by retrieving the most relevant GPS-tagged satellite images.<n>Existing methods heavily rely on strictly pre-paired drone-satellite images for supervised learning.<n>We propose an end-to-end self-supervised learning method with a shallow backbone network.
arXiv Detail & Related papers (2025-02-17T02:53:08Z)
Uncertainty-Participation Context Consistency Learning for Semi-supervised Semantic Segmentation [9.546065701435532]
Semi-supervised semantic segmentation has attracted considerable attention for its ability to mitigate the reliance on extensive labeled data. This paper proposes the Uncertainty-participation Context Consistency Learning (UCCL) method to explore richer supervisory signals.
arXiv Detail & Related papers (2024-12-23T06:49:59Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
SCD-Net: Spatiotemporal Clues Disentanglement Network for Self-supervised Skeleton-based Action Recognition [39.99711066167837]
This paper introduces a contrastive learning framework, namely Stemporal Clues Disentanglement Network (SCD-Net) Specifically, we integrate the sequences with a feature extractor to derive explicit clues from spatial and temporal domains respectively. We conduct evaluations on the NTU-+D (60&120) PKU-MMDI (&I) datasets, covering various downstream tasks such as action recognition, action retrieval, transfer learning.
arXiv Detail & Related papers (2023-09-11T21:32:13Z)
Structured Video-Language Modeling with Temporal Grouping and Spatial Grounding [112.3913646778859]
We propose a simple yet effective video-language modeling framework, S-ViLM. It includes two novel designs, inter-clip spatial grounding and intra-clip temporal grouping, to promote learning region-object alignment and temporal-aware features. S-ViLM surpasses the state-of-the-art methods substantially on four representative downstream tasks.
arXiv Detail & Related papers (2023-03-28T22:45:07Z)
Weakly-Supervised Temporal Action Localization by Inferring Salient Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously. Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video. We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z)
Exploit CAM by itself: Complementary Learning System for Weakly Supervised Semantic Segmentation [59.24824050194334]
This paper turns to an interesting working mechanism in agent learning named Complementary Learning System ( CLS) Motivated by this simple but effective learning pattern, we propose a General-Specific Learning Mechanism (GSLM) GSLM develops a General Learning Module (GLM) and a Specific Learning Module (SLM)
arXiv Detail & Related papers (2023-03-04T16:16:47Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
HiCLRE: A Hierarchical Contrastive Learning Framework for Distantly Supervised Relation Extraction [24.853265244512954]
We propose a hierarchical contrastive learning Framework for DistantlySupervised relation extraction (HiCLRE) to reduce noisy sentences. Specifically, we propose a three-level hierarchical learning framework to interact with cross levels, generating the de-noising context-aware representations. Experiments demonstrate that HiCLRE significantly outperforms strong baselines in various mainstream DSRE datasets.
arXiv Detail & Related papers (2022-02-27T12:48:26Z)
Transferable Knowledge-Based Multi-Granularity Aggregation Network for Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge. We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals. We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos. Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z)
Weakly Supervised Temporal Action Localization Through Learning Explicit Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision. We introduce a framework that learns two feature subspaces respectively for actions and their context. The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z)
Intra- and Inter-Action Understanding via Temporal Action Parsing [118.32912239230272]
We construct a new dataset developed on sport videos with manual annotations of sub-actions, and conduct a study on temporal action parsing on top. Our study shows that a sport activity usually consists of multiple sub-actions and that the awareness of such temporal structures is beneficial to action recognition. We also investigate a number of temporal parsing methods, and thereon devise an improved method that is capable of mining sub-actions from training data without knowing the labels of them.
arXiv Detail & Related papers (2020-05-20T17:45:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.