HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers
- URL: http://arxiv.org/abs/2207.09662v2
- Date: Thu, 21 Jul 2022 01:42:00 GMT
- Title: HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers
- Authors: Tae-Kyung Kang, Gun-Hee Lee, and Seong-Whan Lee
- Abstract summary: Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
- Score: 19.48000379201692
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Temporal action localization (TAL) is a task of identifying a set of actions
in a video, which involves localizing the start and end frames and classifying
each action instance. Existing methods have addressed this task by using
predefined anchor windows or heuristic bottom-up boundary-matching strategies,
which are major bottlenecks in inference time. Additionally, the main challenge
is the inability to capture long-range actions due to a lack of global
contextual information. In this paper, we present a novel anchor-free
framework, referred to as HTNet, which predicts a set of <start time, end time,
class> triplets from a video based on a Transformer architecture. After the
prediction of coarse boundaries, we refine it through a background feature
sampling (BFS) module and hierarchical Transformers, which enables our model to
aggregate global contextual information and effectively exploit the inherent
semantic relationships in a video. We demonstrate how our method localizes
accurate action instances and achieves state-of-the-art performance on two TAL
benchmark datasets: THUMOS14 and ActivityNet 1.3.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - Transferable Knowledge-Based Multi-Granularity Aggregation Network for
Temporal Action Localization: Submission to ActivityNet Challenge 2021 [33.840281113206444]
This report presents an overview of our solution used in the submission to 2021 HACS Temporal Action localization Challenge.
We use Temporal Context Aggregation Network (TCANet) to generate high-quality action proposals.
We also adopt an additional module to transfer the knowledge from trimmed videos to untrimmed videos.
Our proposed scheme achieves 39.91 and 29.78 average mAP on the challenge testing set of supervised and weakly-supervised temporal action localization track respectively.
arXiv Detail & Related papers (2021-07-27T06:18:21Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - ACM-Net: Action Context Modeling Network for Weakly-Supervised Temporal
Action Localization [18.56421375743287]
We propose an action-context modeling network termed ACM-Net.
It integrates a three-branch attention module to measure the likelihood of each temporal point being action instance, context, or non-action background, simultaneously.
Our method can outperform current state-of-the-art methods, and even achieve comparable performance with fully-supervised methods.
arXiv Detail & Related papers (2021-04-07T07:39:57Z) - Weakly Supervised Temporal Action Localization Through Learning Explicit
Subspaces for Action and Context [151.23835595907596]
Methods learn to localize temporal starts and ends of action instances in a video under only video-level supervision.
We introduce a framework that learns two feature subspaces respectively for actions and their context.
The proposed approach outperforms state-of-the-art WS-TAL methods on three benchmarks.
arXiv Detail & Related papers (2021-03-30T08:26:53Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - ACSNet: Action-Context Separation Network for Weakly Supervised Temporal
Action Localization [148.55210919689986]
We introduce an Action-Context Separation Network (ACSNet) that takes into account context for accurate action localization.
ACSNet outperforms existing state-of-the-art WS-TAL methods by a large margin.
arXiv Detail & Related papers (2021-03-28T09:20:54Z) - Learning Salient Boundary Feature for Anchor-free Temporal Action
Localization [81.55295042558409]
Temporal action localization is an important yet challenging task in video understanding.
We propose the first purely anchor-free temporal localization method.
Our model includes (i) an end-to-end trainable basic predictor, (ii) a saliency-based refinement module, and (iii) several consistency constraints.
arXiv Detail & Related papers (2021-03-24T12:28:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.