Temporal Transformer Networks with Self-Supervision for Action
Recognition
- URL: http://arxiv.org/abs/2112.07338v1
- Date: Tue, 14 Dec 2021 12:53:53 GMT
- Title: Temporal Transformer Networks with Self-Supervision for Action
Recognition
- Authors: Yongkang Zhang, Jun Li, Guoming Wu, Han Zhang, Zhiping Shi, Zhaoxun
Liu, Zizhang Wu, Na Jiang
- Abstract summary: We introduce a startling Temporal Transformer Network with Self-supervision (TTSN)
TTSN consists of a temporal transformer module and a temporal sequence self-supervision module.
Our proposed TTSN is promising as it successfully achieves state-of-the-art performance for action recognition.
- Score: 13.00827959393591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, 2D Convolutional Networks-based video action recognition has
encouragingly gained wide popularity; However, constrained by the lack of
long-range non-linear temporal relation modeling and reverse motion information
modeling, the performance of existing models is, therefore, undercut seriously.
To address this urgent problem, we introduce a startling Temporal Transformer
Network with Self-supervision (TTSN). Our high-performance TTSN mainly consists
of a temporal transformer module and a temporal sequence self-supervision
module. Concisely speaking, we utilize the efficient temporal transformer
module to model the non-linear temporal dependencies among non-local frames,
which significantly enhances complex motion feature representations. The
temporal sequence self-supervision module we employ unprecedentedly adopts the
streamlined strategy of "random batch random channel" to reverse the sequence
of video frames, allowing robust extractions of motion information
representation from inversed temporal dimensions and improving the
generalization capability of the model. Extensive experiments on three widely
used datasets (HMDB51, UCF101, and Something-something V1) have conclusively
demonstrated that our proposed TTSN is promising as it successfully achieves
state-of-the-art performance for action recognition.
Related papers
- Cascaded Temporal Updating Network for Efficient Video Super-Resolution [47.63267159007611]
Key components in recurrent-based VSR networks significantly impact model efficiency.
We propose a cascaded temporal updating network (CTUN) for efficient VSR.
CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods.
arXiv Detail & Related papers (2024-08-26T12:59:32Z) - TDS-CLIP: Temporal Difference Side Network for Image-to-Video Transfer Learning [6.329214318116305]
We propose a memory-efficient Temporal Difference Side Network ( TDS-CLIP) to balance knowledge transferring and temporal modeling.
Specifically, we introduce a Temporal Difference Adapter (TD-Adapter), which can effectively capture local temporal differences in motion features.
We also designed a Side Motion Enhancement Adapter (SME-Adapter) to guide the proposed side network in efficiently learning the rich motion information in videos.
arXiv Detail & Related papers (2024-08-20T09:40:08Z) - PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting [6.434378359932152]
We review and categorize existing Transformer-based models into two main types: (1) modifications to the model structure and (2) modifications to the input data.
We propose $textbfsTransformer$, which introduces the Sequence and Temporal Convolutional Network (STCN) to fully capture both sequential and temporal information.
We compare our model with linear models and existing forecasting models on long-term time-series forecasting, achieving new state-of-the-art results.
arXiv Detail & Related papers (2024-08-19T06:23:41Z) - Transformer-based Video Saliency Prediction with High Temporal Dimension
Decoding [12.595019348741042]
We propose a transformer-based video saliency prediction approach with high temporal dimension network decoding (THTDNet)
This architecture yields comparable performance to multi-branch and over-complicated models on common benchmarks such as DHF1K, UCF-sports and Hollywood-2.
arXiv Detail & Related papers (2024-01-15T20:09:56Z) - F3-Pruning: A Training-Free and Generalized Pruning Strategy towards
Faster and Finer Text-to-Video Synthesis [94.10861578387443]
We explore the inference process of two mainstream T2V models using transformers and diffusion models.
We propose a training-free and generalized pruning strategy called F3-Pruning to prune redundant temporal attention weights.
Extensive experiments on three datasets using a classic transformer-based model CogVideo and a typical diffusion-based model Tune-A-Video verify the effectiveness of F3-Pruning.
arXiv Detail & Related papers (2023-12-06T12:34:47Z) - FormerTime: Hierarchical Multi-Scale Representations for Multivariate
Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task.
It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z) - Towards Long-Term Time-Series Forecasting: Feature, Pattern, and
Distribution [57.71199089609161]
Long-term time-series forecasting (LTTF) has become a pressing demand in many applications, such as wind power supply planning.
Transformer models have been adopted to deliver high prediction capacity because of the high computational self-attention mechanism.
We propose an efficient Transformerbased model, named Conformer, which differentiates itself from existing methods for LTTF in three aspects.
arXiv Detail & Related papers (2023-01-05T13:59:29Z) - Slow-Fast Visual Tempo Learning for Video-based Action Recognition [78.3820439082979]
Action visual tempo characterizes the dynamics and the temporal scale of an action.
Previous methods capture the visual tempo either by sampling raw videos with multiple rates, or by hierarchically sampling backbone features.
We propose a Temporal Correlation Module (TCM) to extract action visual tempo from low-level backbone features at single-layer remarkably.
arXiv Detail & Related papers (2022-02-24T14:20:04Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - TDN: Temporal Difference Networks for Efficient Action Recognition [31.922001043405924]
This paper presents a new video architecture, termed as Temporal Difference Network (TDN)
The core of our TDN is to devise an efficient temporal module (TDM) by explicitly leveraging a temporal difference operator.
Our TDN presents a new state of the art on the Something-Something V1 & V2 datasets and is on par with the best performance on the Kinetics-400 dataset.
arXiv Detail & Related papers (2020-12-18T06:31:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.