ACGNet: Action Complement Graph Network for Weakly-supervised Temporal
Action Localization
- URL: http://arxiv.org/abs/2112.10977v1
- Date: Tue, 21 Dec 2021 04:18:44 GMT
- Title: ACGNet: Action Complement Graph Network for Weakly-supervised Temporal
Action Localization
- Authors: Zichen Yang, Jie Qin, Di Huang
- Abstract summary: Weakly-trimmed temporal action localization (WTAL) in unsupervised videos has emerged as a practical but challenging task since only video-level labels are available.
Existing approaches typically leverage off-the-shelf segment-level features, which suffer from spatial incompleteness and temporal incoherence.
In this paper, we tackle this problem by enhancing segment-level representations with a simple yet effective graph convolutional network.
- Score: 39.377289930528555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Weakly-supervised temporal action localization (WTAL) in untrimmed videos has
emerged as a practical but challenging task since only video-level labels are
available. Existing approaches typically leverage off-the-shelf segment-level
features, which suffer from spatial incompleteness and temporal incoherence,
thus limiting their performance. In this paper, we tackle this problem from a
new perspective by enhancing segment-level representations with a simple yet
effective graph convolutional network, namely action complement graph network
(ACGNet). It facilitates the current video segment to perceive spatial-temporal
dependencies from others that potentially convey complementary clues,
implicitly mitigating the negative effects caused by the two issues above. By
this means, the segment-level features are more discriminative and robust to
spatial-temporal variations, contributing to higher localization accuracies.
More importantly, the proposed ACGNet works as a universal module that can be
flexibly plugged into different WTAL frameworks, while maintaining the
end-to-end training fashion. Extensive experiments are conducted on the
THUMOS'14 and ActivityNet1.2 benchmarks, where the state-of-the-art results
clearly demonstrate the superiority of the proposed approach.
Related papers
- AttenScribble: Attentive Similarity Learning for Scribble-Supervised
Medical Image Segmentation [5.8447004333496855]
In this paper, we present a straightforward yet effective scribble supervised learning framework.
We create a pluggable spatial self-attention module which could be attached on top of any internal feature layers of arbitrary fully convolutional network (FCN) backbone.
This attentive similarity leads to a novel regularization loss that imposes consistency between segmentation prediction and visual affinity.
arXiv Detail & Related papers (2023-12-11T18:42:18Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - Cross-Video Contextual Knowledge Exploration and Exploitation for
Ambiguity Reduction in Weakly Supervised Temporal Action Localization [23.94629999419033]
Weakly supervised temporal action localization (WSTAL) aims to localize actions in untrimmed videos using video-level labels.
Our work addresses this from a novel perspective, by exploring and exploiting the cross-video contextual knowledge within the dataset.
Our method outperforms the state-of-the-art methods, and can be easily plugged into other WSTAL methods.
arXiv Detail & Related papers (2023-08-24T07:19:59Z) - DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised
Temporal Action Localization [40.521076622370806]
We propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections.
Experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net.
arXiv Detail & Related papers (2023-07-31T05:48:39Z) - Weakly-Supervised Temporal Action Localization by Inferring Salient
Snippet-Feature [26.7937345622207]
Weakly-supervised temporal action localization aims to locate action regions and identify action categories in unsupervised videos simultaneously.
Pseudo label generation is a promising strategy to solve the challenging problem, but the current methods ignore the natural temporal structure of the video.
We propose a novel weakly-supervised temporal action localization method by inferring salient snippet-feature.
arXiv Detail & Related papers (2023-03-22T06:08:34Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Video Self-Stitching Graph Network for Temporal Action Localization [54.1254121061467]
We propose a multi-level cross-scale solution dubbed as video self-stitching graph network (VSGN)
We have two key components in VSGN: video self-stitching (VSS) and cross-scale graph pyramid network (xGPN)
Our VSGN not only enhances the feature representations, but also generates more positive anchors for short actions and more short training samples.
arXiv Detail & Related papers (2020-11-30T07:44:52Z) - Reinforcement Learning for Weakly Supervised Temporal Grounding of
Natural Language in Untrimmed Videos [134.78406021194985]
We focus on the weakly supervised setting of this task that merely accesses to coarse video-level language description annotation without temporal boundary.
We propose a emphBoundary Adaptive Refinement (BAR) framework that resorts to reinforcement learning to guide the process of progressively refining the temporal boundary.
arXiv Detail & Related papers (2020-09-18T03:32:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.