Modeling Multi-Label Action Dependencies for Temporal Action
Localization
- URL: http://arxiv.org/abs/2103.03027v2
- Date: Fri, 5 Mar 2021 02:13:00 GMT
- Title: Modeling Multi-Label Action Dependencies for Temporal Action
Localization
- Authors: Praveen Tirupattur, Kevin Duarte, Yogesh Rawat, Mubarak Shah
- Abstract summary: Real-world videos contain many complex actions with inherent relationships between action classes.
We propose an attention-based architecture that models these action relationships for the task of temporal action localization in unoccurrence videos.
We show improved performance over state-of-the-art methods on multi-label action localization benchmarks.
- Score: 53.53490517832068
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-world videos contain many complex actions with inherent relationships
between action classes. In this work, we propose an attention-based
architecture that models these action relationships for the task of temporal
action localization in untrimmed videos. As opposed to previous works that
leverage video-level co-occurrence of actions, we distinguish the relationships
between actions that occur at the same time-step and actions that occur at
different time-steps (i.e. those which precede or follow each other). We define
these distinct relationships as action dependencies. We propose to improve
action localization performance by modeling these action dependencies in a
novel attention-based Multi-Label Action Dependency (MLAD)layer. The MLAD layer
consists of two branches: a Co-occurrence Dependency Branch and a Temporal
Dependency Branch to model co-occurrence action dependencies and temporal
action dependencies, respectively. We observe that existing metrics used for
multi-label classification do not explicitly measure how well action
dependencies are modeled, therefore, we propose novel metrics that consider
both co-occurrence and temporal dependencies between action classes. Through
empirical evaluation and extensive analysis, we show improved performance over
state-of-the-art methods on multi-label action localization
benchmarks(MultiTHUMOS and Charades) in terms of f-mAP and our proposed metric.
Related papers
- An Effective-Efficient Approach for Dense Multi-Label Action Detection [23.100602876056165]
It is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships.
Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks.
We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information.
arXiv Detail & Related papers (2024-06-10T11:33:34Z) - BIT: Bi-Level Temporal Modeling for Efficient Supervised Action
Segmentation [34.88225099758585]
supervised action segmentation aims to partition a video into non-overlapping segments, each representing a different action.
Recent works apply transformers to perform temporal modeling at the frame-level, which suffer from high computational cost.
We propose an efficient BI-level Temporal modeling framework that learns explicit action tokens to represent action segments.
arXiv Detail & Related papers (2023-08-28T20:59:15Z) - Fine-grained Temporal Contrastive Learning for Weakly-supervised
Temporal Action Localization [87.47977407022492]
This paper argues that learning by contextually comparing sequence-to-sequence distinctions offers an essential inductive bias in weakly-supervised action localization.
Under a differentiable dynamic programming formulation, two complementary contrastive objectives are designed, including Fine-grained Sequence Distance (FSD) contrasting and Longest Common Subsequence (LCS) contrasting.
Our method achieves state-of-the-art performance on two popular benchmarks.
arXiv Detail & Related papers (2022-03-31T05:13:50Z) - ASM-Loc: Action-aware Segment Modeling for Weakly-Supervised Temporal
Action Localization [36.90693762365237]
Weakly-supervised temporal action localization aims to recognize and localize action segments in untrimmed videos given only video-level action labels for training.
We propose system, a novel WTAL framework that enables explicit, action-aware segment modeling beyond standard MIL-based methods.
Our framework entails three segment-centric components: (i) dynamic segment sampling for compensating the contribution of short actions; (ii) intra- and inter-segment attention for modeling action dynamics and capturing temporal dependencies; (iii) pseudo instance-level supervision for improving action boundary prediction.
arXiv Detail & Related papers (2022-03-29T01:59:26Z) - Graph Convolutional Module for Temporal Action Localization in Videos [142.5947904572949]
We claim that the relations between action units play an important role in action localization.
A more powerful action detector should not only capture the local content of each action unit but also allow a wider field of view on the context related to it.
We propose a general graph convolutional module (GCM) that can be easily plugged into existing action localization methods.
arXiv Detail & Related papers (2021-12-01T06:36:59Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Partially Observed Exchangeable Modeling [14.466964173883948]
We propose a novel framework, partially observed exchangeable modeling (POEx)
POEx takes in a set of related partially observed instances and infers the conditional distribution for the unobserved dimensions over multiple elements.
Our approach jointly models the intra-instance (among features in a point) and inter-instance (among multiple points in a set) dependencies in data.
arXiv Detail & Related papers (2021-02-11T15:54:18Z) - Learning Robust State Abstractions for Hidden-Parameter Block MDPs [55.31018404591743]
We leverage ideas of common structure from the HiP-MDP setting to enable robust state abstractions inspired by Block MDPs.
We derive instantiations of this new framework for both multi-task reinforcement learning (MTRL) and meta-reinforcement learning (Meta-RL) settings.
arXiv Detail & Related papers (2020-07-14T17:25:27Z) - Weakly-Supervised Action Localization by Generative Attention Modeling [65.03548422403061]
Weakly-supervised temporal action localization is a problem of learning an action localization model with only video-level action labeling available.
We propose to model the class-agnostic frame-wise conditioned probability on the frame attention using conditional Variational Auto-Encoder (VAE)
By maximizing the conditional probability with respect to the attention, the action and non-action frames are well separated.
arXiv Detail & Related papers (2020-03-27T14:02:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.