Related papers: Two-Stream temporal transformer for video action classification

Two-Stream temporal transformer for video action classification

URL: http://arxiv.org/abs/2601.14086v1
Date: Tue, 20 Jan 2026 15:47:00 GMT
Title: Two-Stream temporal transformer for video action classification
Authors: Nattapong Kurpukdee, Adrian G. Bors,
Abstract summary: Motion representation plays an important role in video understanding and has many applications including encoder action recognition, robot and autonomous guidance or others.<n>Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications.
Score: 47.53991869205973
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Motion representation plays an important role in video understanding and has many applications including action recognition, robot and autonomous guidance or others. Lately, transformer networks, through their self-attention mechanism capabilities, have proved their efficiency in many applications. In this study, we introduce a new two-stream transformer video classifier, which extracts spatio-temporal information from content and optical flow representing movement information. The proposed model identifies self-attention features across the joint optical flow and temporal frame domain and represents their relationships within the transformer encoder mechanism. The experimental results show that our proposed methodology provides excellent classification results on three well-known video datasets of human activities.

Related papers

Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision. We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos. Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z)
A Multi-Modal Transformer Network for Action Detection [15.104201344012347]
This paper proposes a novel multi-modal transformer network for detecting actions in untrimmed videos. We suggest an algorithm that corrects the motion distortion caused by camera movement. Our proposed algorithm outperforms the state-of-the-art methods on two public benchmarks.
arXiv Detail & Related papers (2023-05-31T07:50:38Z)
VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation. We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z)
Vision Transformers for Action Recognition: A Survey [41.69370782177517]
Vision transformers are emerging as a powerful tool to solve computer vision problems. Recent techniques have proven the efficacy of transformers beyond the image domain to solve numerous video-related tasks. Human action recognition is receiving special attention from the research community due to its widespread applications.
arXiv Detail & Related papers (2022-09-13T02:57:05Z)
Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better. Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs. Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z)
Hierarchical Multimodal Transformer to Summarize Videos [103.47766795086206]
Motivated by the great success of transformer and the natural structure of video (frame-shot-video), a hierarchical transformer is developed for video summarization. To integrate the two kinds of information, they are encoded in a two-stream scheme, and a multimodal fusion mechanism is developed based on the hierarchical transformer. Practically, extensive experiments show that HMT surpasses most of the traditional, RNN-based and attention-based video summarization methods.
arXiv Detail & Related papers (2021-09-22T07:38:59Z)
CCVS: Context-aware Controllable Video Synthesis [95.22008742695772]
presentation introduces a self-supervised learning approach to the synthesis of new video clips from old ones. It conditions the synthesis process on contextual information for temporal continuity and ancillary information for fine control.
arXiv Detail & Related papers (2021-07-16T17:57:44Z)
Actor-Transformers for Group Activity Recognition [43.60866347282833]
This paper strives to recognize individual actions and group activities from videos. We propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition.
arXiv Detail & Related papers (2020-03-28T07:21:58Z)
Motion-Attentive Transition for Zero-Shot Video Object Segmentation [99.44383412488703]
We present a Motion-Attentive Transition Network (MATNet) for zero-shot object segmentation. An asymmetric attention block, called Motion-Attentive Transition (MAT), is designed within a two-stream encoder. In this way, the encoder becomes deeply internative, allowing for closely hierarchical interactions between object motion and appearance.
arXiv Detail & Related papers (2020-03-09T16:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.