SimOn: A Simple Framework for Online Temporal Action Localization
- URL: http://arxiv.org/abs/2211.04905v1
- Date: Tue, 8 Nov 2022 04:50:54 GMT
- Title: SimOn: A Simple Framework for Online Temporal Action Localization
- Authors: Tuan N. Tang, Jungin Park, Kwonyoung Kim, Kwanghoon Sohn
- Abstract summary: We propose a framework, termed SimOn, that learns to predict action instances using the popular Transformer architecture.
Experimental results on the THUMOS14 and ActivityNet1.3 datasets show that our model remarkably outperforms the previous methods.
- Score: 51.27476730635852
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Online Temporal Action Localization (On-TAL) aims to immediately provide
action instances from untrimmed streaming videos. The model is not allowed to
utilize future frames and any processing techniques to modify past predictions,
making On-TAL much more challenging. In this paper, we propose a simple yet
effective framework, termed SimOn, that learns to predict action instances
using the popular Transformer architecture in an end-to-end manner.
Specifically, the model takes the current frame feature as a query and a set of
past context information as keys and values of the Transformer. Different from
the prior work that uses a set of outputs of the model as past contexts, we
leverage the past visual context and the learnable context embedding for the
current query. Experimental results on the THUMOS14 and ActivityNet1.3 datasets
show that our model remarkably outperforms the previous methods, achieving a
new state-of-the-art On-TAL performance. In addition, the evaluation for Online
Detection of Action Start (ODAS) demonstrates the effectiveness and robustness
of our method in the online setting. The code is available at
https://github.com/TuanTNG/SimOn
Related papers
- HAT: History-Augmented Anchor Transformer for Online Temporal Action Localization [3.187381965457262]
We introduce the History-Augmented Anchor Transformer (HAT) Framework for OnTAL.
By integrating historical context, our framework enhances the synergy between long-term and short-term information.
We evaluate our model on both procedural egocentric (PREGO) datasets and standard non-PREGO OnTAL datasets.
arXiv Detail & Related papers (2024-08-12T18:29:48Z) - Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - HTNet: Anchor-free Temporal Action Localization with Hierarchical
Transformers [19.48000379201692]
Temporal action localization (TAL) is a task of identifying a set of actions in a video.
We present a novel anchor-free framework, known as HTNet, which predicts a set of start time, end time, class> triplets from a video.
We demonstrate how our method localizes accurate action instances and state-of-the-art performance on two TAL benchmark datasets.
arXiv Detail & Related papers (2022-07-20T05:40:03Z) - Real-time Object Detection for Streaming Perception [84.2559631820007]
Streaming perception is proposed to jointly evaluate the latency and accuracy into a single metric for video online perception.
We build a simple and effective framework for streaming perception.
Our method achieves competitive performance on Argoverse-HD dataset and improves the AP by 4.9% compared to the strong baseline.
arXiv Detail & Related papers (2022-03-23T11:33:27Z) - Background-Click Supervision for Temporal Action Localization [82.4203995101082]
Weakly supervised temporal action localization aims at learning the instance-level action pattern from the video-level labels, where a significant challenge is action-context confusion.
One recent work builds an action-click supervision framework.
It requires similar annotation costs but can steadily improve the localization performance when compared to the conventional weakly supervised methods.
In this paper, by revealing that the performance bottleneck of the existing approaches mainly comes from the background errors, we find that a stronger action localizer can be trained with labels on the background video frames rather than those on the action frames.
arXiv Detail & Related papers (2021-11-24T12:02:52Z) - With a Little Help from my Temporal Context: Multimodal Egocentric
Action Recognition [95.99542238790038]
We propose a method that learns to attend to surrounding actions in order to improve recognition performance.
To incorporate the temporal context, we propose a transformer-based multimodal model that ingests video and audio as input modalities.
We test our approach on EPIC-KITCHENS and EGTEA datasets reporting state-of-the-art performance.
arXiv Detail & Related papers (2021-11-01T15:27:35Z) - End-to-end Temporal Action Detection with Transformer [86.80289146697788]
Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video.
Here, we construct an end-to-end framework for TAD upon Transformer, termed textitTadTR.
Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3.
arXiv Detail & Related papers (2021-06-18T17:58:34Z) - A Novel Online Action Detection Framework from Untrimmed Video Streams [19.895434487276578]
We propose a novel online action detection framework that considers actions as a set of temporally ordered subclasses.
We augment our data by varying the lengths of videos to allow the proposed method to learn about the high intra-class variation in human actions.
arXiv Detail & Related papers (2020-03-17T14:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.