Bottom-Up Temporal Action Localization with Mutual Regularization
- URL: http://arxiv.org/abs/2002.07358v3
- Date: Fri, 26 Feb 2021 02:26:46 GMT
- Title: Bottom-Up Temporal Action Localization with Mutual Regularization
- Authors: Peisen Zhao, Lingxi Xie, Chen Ju, Ya Zhang, Yanfeng Wang, Qi Tian
- Abstract summary: State-of-the-art solutions for TAL involve evaluating the frame-level probabilities of three action-indicating phases.
We introduce two regularization terms to mutually regularize the learning procedure.
Experiments are performed on two popular TAL datasets, THUMOS14 and ActivityNet1.3.
- Score: 107.39785866001868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, temporal action localization (TAL), i.e., finding specific action
segments in untrimmed videos, has attracted increasing attentions of the
computer vision community. State-of-the-art solutions for TAL involves
evaluating the frame-level probabilities of three action-indicating phases,
i.e. starting, continuing, and ending; and then post-processing these
predictions for the final localization. This paper delves deep into this
mechanism, and argues that existing methods, by modeling these phases as
individual classification tasks, ignored the potential temporal constraints
between them. This can lead to incorrect and/or inconsistent predictions when
some frames of the video input lack sufficient discriminative information. To
alleviate this problem, we introduce two regularization terms to mutually
regularize the learning procedure: the Intra-phase Consistency (IntraC)
regularization is proposed to make the predictions verified inside each phase;
and the Inter-phase Consistency (InterC) regularization is proposed to keep
consistency between these phases. Jointly optimizing these two terms, the
entire framework is aware of these potential constraints during an end-to-end
optimization process. Experiments are performed on two popular TAL datasets,
THUMOS14 and ActivityNet1.3. Our approach clearly outperforms the baseline both
quantitatively and qualitatively. The proposed regularization also generalizes
to other TAL methods (e.g., TSA-Net and PGCN). code:
https://github.com/PeisenZhao/Bottom-Up-TAL-with-MR
Related papers
- TS-TCD: Triplet-Level Cross-Modal Distillation for Time-Series Forecasting Using Large Language Models [15.266543423942617]
We present a novel framework, TS-TCD, which introduces a comprehensive three-tiered cross-modal knowledge distillation mechanism.
Unlike prior work that focuses on isolated alignment techniques, our framework systematically integrates.
Experiments on benchmark time-series demonstrate that TS-TCD achieves state-of-the-art results, outperforming traditional methods in both accuracy and robustness.
arXiv Detail & Related papers (2024-09-23T12:57:24Z) - STAT: Towards Generalizable Temporal Action Localization [56.634561073746056]
Weakly-supervised temporal action localization (WTAL) aims to recognize and localize action instances with only video-level labels.
Existing methods suffer from severe performance degradation when transferring to different distributions.
We propose GTAL, which focuses on improving the generalizability of action localization methods.
arXiv Detail & Related papers (2024-04-20T07:56:21Z) - Weakly Supervised Co-training with Swapping Assignments for Semantic Segmentation [21.345548821276097]
Class activation maps (CAMs) are commonly employed in weakly supervised semantic segmentation (WSSS) to produce pseudo-labels.
We propose an end-to-end WSSS model incorporating guided CAMs, wherein our segmentation model is trained while concurrently optimizing CAMs online.
CoSA is the first single-stage approach to outperform all existing multi-stage methods including those with additional supervision.
arXiv Detail & Related papers (2024-02-27T21:08:23Z) - DIR-AS: Decoupling Individual Identification and Temporal Reasoning for
Action Segmentation [84.78383981697377]
Fully supervised action segmentation works on frame-wise action recognition with dense annotations and often suffers from the over-segmentation issue.
We develop a novel local-global attention mechanism with temporal pyramid dilation and temporal pyramid pooling for efficient multi-scale attention.
We achieve state-of-the-art accuracy, eg, 82.8% (+2.6%) on GTEA and 74.7% (+1.2%) on Breakfast, which demonstrates the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-04-04T20:27:18Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - Temporal Transductive Inference for Few-Shot Video Object Segmentation [27.140141181513425]
Few-shot object segmentation (FS-VOS) aims at segmenting video frames using a few labelled examples of classes not seen during initial training.
Key to our approach is the use of both global and local temporal constraints.
Empirically, our model outperforms state-of-the-art meta-learning approaches in terms of mean intersection over union on YouTube-VIS by 2.8%.
arXiv Detail & Related papers (2022-03-27T14:08:30Z) - Temporal Context Aggregation Network for Temporal Action Proposal
Refinement [93.03730692520999]
Temporal action proposal generation is a challenging yet important task in the video understanding field.
Current methods still suffer from inaccurate temporal boundaries and inferior confidence used for retrieval.
We propose TCANet to generate high-quality action proposals through "local and global" temporal context aggregation.
arXiv Detail & Related papers (2021-03-24T12:34:49Z) - Two-Stream Consensus Network for Weakly-Supervised Temporal Action
Localization [94.37084866660238]
We present a Two-Stream Consensus Network (TSCN) to simultaneously address these challenges.
The proposed TSCN features an iterative refinement training method, where a frame-level pseudo ground truth is iteratively updated.
We propose a new attention normalization loss to encourage the predicted attention to act like a binary selection, and promote the precise localization of action instance boundaries.
arXiv Detail & Related papers (2020-10-22T10:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.