PAT: Position-Aware Transformer for Dense Multi-Label Action Detection
- URL: http://arxiv.org/abs/2308.05051v1
- Date: Wed, 9 Aug 2023 16:29:31 GMT
- Title: PAT: Position-Aware Transformer for Dense Multi-Label Action Detection
- Authors: Faegheh Sardari, Armin Mustafa, Philip J. B. Jackson, and Adrian
Hilton
- Abstract summary: We present PAT, a transformer-based network that learns complex temporal co-occurrence action dependencies in a video.
We embed relative positional encoding in the self-attention mechanism and exploit multi-scale temporal relationships.
We evaluate the performance of our proposed approach on two challenging dense multi-label benchmark datasets.
- Score: 36.39340228621982
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present PAT, a transformer-based network that learns complex temporal
co-occurrence action dependencies in a video by exploiting multi-scale temporal
features. In existing methods, the self-attention mechanism in transformers
loses the temporal positional information, which is essential for robust action
detection. To address this issue, we (i) embed relative positional encoding in
the self-attention mechanism and (ii) exploit multi-scale temporal
relationships by designing a novel non hierarchical network, in contrast to the
recent transformer-based approaches that use a hierarchical structure. We argue
that joining the self-attention mechanism with multiple sub-sampling processes
in the hierarchical approaches results in increased loss of positional
information. We evaluate the performance of our proposed approach on two
challenging dense multi-label benchmark datasets, and show that PAT improves
the current state-of-the-art result by 1.1% and 0.6% mAP on the Charades and
MultiTHUMOS datasets, respectively, thereby achieving the new state-of-the-art
mAP at 26.5% and 44.6%, respectively. We also perform extensive ablation
studies to examine the impact of the different components of our proposed
network.
Related papers
- Multimodal Task Representation Memory Bank vs. Catastrophic Forgetting in Anomaly Detection [6.991692485111346]
Unsupervised Continuous Anomaly Detection (UCAD) faces significant challenges in multi-task representation learning.
We propose the Multimodal Task Representation Memory Bank (MTRMB) method through two key technical innovations.
Experiments on MVtec AD and VisA datasets demonstrate MTRMB's superiority, achieving an average detection accuracy of 0.921 at the lowest forgetting rate.
arXiv Detail & Related papers (2025-02-10T06:49:54Z) - Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that model how neurons in the brain communicate through discrete spikes.
In this paper, we introduce an approximate method for relative positional encoding (RPE) in Spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z) - An Effective-Efficient Approach for Dense Multi-Label Action Detection [23.100602876056165]
It is necessary to simultaneously learn (i) temporal dependencies and (ii) co-occurrence action relationships.
Recent approaches model temporal information by extracting multi-scale features through hierarchical transformer-based networks.
We argue that combining this with multiple sub-sampling processes in hierarchical designs can lead to further loss of positional information.
arXiv Detail & Related papers (2024-06-10T11:33:34Z) - Correlated Attention in Transformers for Multivariate Time Series [22.542109523780333]
We propose a novel correlated attention mechanism, which efficiently captures feature-wise dependencies, and can be seamlessly integrated within the encoder blocks of existing Transformers.
In particular, correlated attention operates across feature channels to compute cross-covariance matrices between queries and keys with different lag values, and selectively aggregate representations at the sub-series level.
This architecture facilitates automated discovery and representation learning of not only instantaneous but also lagged cross-correlations, while inherently capturing time series auto-correlation.
arXiv Detail & Related papers (2023-11-20T17:35:44Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Group Gated Fusion on Attention-based Bidirectional Alignment for
Multimodal Emotion Recognition [63.07844685982738]
This paper presents a new model named as Gated Bidirectional Alignment Network (GBAN), which consists of an attention-based bidirectional alignment network over LSTM hidden states.
We empirically show that the attention-aligned representations outperform the last-hidden-states of LSTM significantly.
The proposed GBAN model outperforms existing state-of-the-art multimodal approaches on the IEMOCAP dataset.
arXiv Detail & Related papers (2022-01-17T09:46:59Z) - Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer [41.44769642537572]
Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs.
We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-03T10:52:06Z) - Cross-modal Consensus Network for Weakly Supervised Temporal Action
Localization [74.34699679568818]
Weakly supervised temporal action localization (WS-TAL) is a challenging task that aims to localize action instances in the given video with video-level categorical supervision.
We propose a cross-modal consensus network (CO2-Net) to tackle this problem.
arXiv Detail & Related papers (2021-07-27T04:21:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.