Activity Graph Transformer for Temporal Action Localization
- URL: http://arxiv.org/abs/2101.08540v2
- Date: Thu, 28 Jan 2021 12:14:19 GMT
- Title: Activity Graph Transformer for Temporal Action Localization
- Authors: Megha Nawhal, Greg Mori
- Abstract summary: We introduce Activity Graph Transformer, an end-to-end learnable model for temporal action localization.
In this work, we capture this non-linear temporal structure by reasoning over the videos as non-sequential entities in the form of graphs.
Our results show that our proposed model outperforms the state-of-the-art by a considerable margin.
- Score: 41.69734359113706
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Activity Graph Transformer, an end-to-end learnable model for
temporal action localization, that receives a video as input and directly
predicts a set of action instances that appear in the video. Detecting and
localizing action instances in untrimmed videos requires reasoning over
multiple action instances in a video. The dominant paradigms in the literature
process videos temporally to either propose action regions or directly produce
frame-level detections. However, sequential processing of videos is problematic
when the action instances have non-sequential dependencies and/or non-linear
temporal ordering, such as overlapping action instances or re-occurrence of
action instances over the course of the video. In this work, we capture this
non-linear temporal structure by reasoning over the videos as non-sequential
entities in the form of graphs. We evaluate our model on challenging datasets:
THUMOS14, Charades, and EPIC-Kitchens-100. Our results show that our proposed
model outperforms the state-of-the-art by a considerable margin.
Related papers
- Technical Report for ActivityNet Challenge 2022 -- Temporal Action Localization [20.268572246761895]
We propose to locate the temporal boundaries of each action and predict action class in untrimmed videos.
Faster-TAD simplifies the pipeline of TAD and gets remarkable performance.
arXiv Detail & Related papers (2024-10-31T14:16:56Z) - Video-Specific Query-Key Attention Modeling for Weakly-Supervised
Temporal Action Localization [14.43055117008746]
Weakly-trimmed temporal action localization aims to identify and localize the action instances in the unsupervised videos with only video-level action labels.
We propose a network named VQK-Net with a video-specific query-key attention modeling that learns a unique query for each action category of each input video.
arXiv Detail & Related papers (2023-05-07T04:18:22Z) - Streaming Video Temporal Action Segmentation In Real Time [2.8728707559692475]
We propose a real-time end-to-end multi-modality model for streaming video real-time temporal action segmentation task.
Our model segments human action in real time with less than 40% of state-of-the-art model computation and achieves 90% of the accuracy of the full video state-of-the-art model.
arXiv Detail & Related papers (2022-09-28T03:27:37Z) - Video Action Detection: Analysing Limitations and Challenges [70.01260415234127]
We analyze existing datasets on video action detection and discuss their limitations.
We perform a biasness study which analyzes a key property differentiating videos from static images: the temporal aspect.
Such extreme experiments show existence of biases which have managed to creep into existing methods inspite of careful modeling.
arXiv Detail & Related papers (2022-04-17T00:42:14Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - FineAction: A Fined Video Dataset for Temporal Action Localization [60.90129329728657]
FineAction is a new large-scale fined video dataset collected from existing video datasets and web videos.
This dataset contains 139K fined action instances densely annotated in almost 17K untrimmed videos spanning 106 action categories.
Experimental results reveal that our FineAction brings new challenges for action localization on fined and multi-label instances with shorter duration.
arXiv Detail & Related papers (2021-05-24T06:06:32Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z) - Action Graphs: Weakly-supervised Action Localization with Graph
Convolution Networks [25.342482374259017]
We present a method for weakly-supervised action localization based on graph convolutions.
Our method utilizes similarity graphs that encode appearance and motion, and pushes the state of the art on THUMOS '14, ActivityNet 1.2, and Charades for weakly supervised action localization.
arXiv Detail & Related papers (2020-02-04T18:21:10Z) - Joint Visual-Temporal Embedding for Unsupervised Learning of Actions in
Untrimmed Sequences [25.299599341774204]
This paper proposes an approach for the unsupervised learning of actions in untrimmed video sequences based on a joint visual-temporal embedding space.
We show that the proposed approach is able to provide a meaningful visual and temporal embedding out of the visual cues present in contiguous video frames.
arXiv Detail & Related papers (2020-01-29T22:51:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.