EgoViT: Pyramid Video Transformer for Egocentric Action Recognition
- URL: http://arxiv.org/abs/2303.08920v1
- Date: Wed, 15 Mar 2023 20:33:50 GMT
- Title: EgoViT: Pyramid Video Transformer for Egocentric Action Recognition
- Authors: Chenbin Pan, Zhiqi Zhang, Senem Velipasalar, Yi Xu
- Abstract summary: Capturing interaction of hands with objects is important to autonomously detect human actions from egocentric videos.
We present a pyramid video transformer with a dynamic class token generator for egocentric action recognition.
- Score: 18.05706639179499
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Capturing interaction of hands with objects is important to autonomously
detect human actions from egocentric videos. In this work, we present a pyramid
video transformer with a dynamic class token generator for egocentric action
recognition. Different from previous video transformers, which use the same
static embedding as the class token for diverse inputs, we propose a dynamic
class token generator that produces a class token for each input video by
analyzing the hand-object interaction and the related motion information. The
dynamic class token can diffuse such information to the entire model by
communicating with other informative tokens in the subsequent transformer
layers. With the dynamic class token, dissimilarity between videos can be more
prominent, which helps the model distinguish various inputs. In addition,
traditional video transformers explore temporal features globally, which
requires large amounts of computation. However, egocentric videos often have a
large amount of background scene transition, which causes discontinuities
across distant frames. In this case, blindly reducing the temporal sampling
rate will risk losing crucial information. Hence, we also propose a pyramid
architecture to hierarchically process the video from short-term high rate to
long-term low rate. With the proposed architecture, we significantly reduce the
computational cost as well as the memory requirement without sacrificing from
the model performance. We perform comparisons with different baseline video
transformers on the EPIC-KITCHENS-100 and EGTEA Gaze+ datasets. Both
quantitative and qualitative results show that the proposed model can
efficiently improve the performance for egocentric action recognition.
Related papers
- VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - SViTT: Temporal Learning of Sparse Video-Text Transformers [65.93031164906812]
We propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention.
SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and sparsity that discards uninformative visual tokens.
arXiv Detail & Related papers (2023-04-18T08:17:58Z) - Time Is MattEr: Temporal Self-supervision for Video Transformers [72.42240984211283]
We design simple yet effective self-supervised tasks for video models to learn temporal dynamics better.
Our method learns the temporal order of video frames as extra self-supervision and enforces the randomly shuffled frames to have low-confidence outputs.
Under various video action recognition tasks, we demonstrate the effectiveness of our method and its compatibility with state-of-the-art Video Transformers.
arXiv Detail & Related papers (2022-07-19T04:44:08Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Efficient Video Transformers with Spatial-Temporal Token Selection [68.27784654734396]
We present STTS, a token selection framework that dynamically selects a few informative tokens in both temporal and spatial dimensions conditioned on input video samples.
Our framework achieves similar results while requiring 20% less computation.
arXiv Detail & Related papers (2021-11-23T00:35:58Z) - Leveraging Local Temporal Information for Multimodal Scene
Classification [9.548744259567837]
Video scene classification models should capture the spatial (pixel-wise) and temporal (frame-wise) characteristics of a video effectively.
Transformer models with self-attention which are designed to get contextualized representations for individual tokens given a sequence of tokens, are becoming increasingly popular in many computer vision tasks.
We propose a novel self-attention block that leverages both local and global temporal relationships between the video frames to obtain better contextualized representations for the individual frames.
arXiv Detail & Related papers (2021-10-26T19:58:32Z) - Object-Region Video Transformers [100.23380634952083]
We present Object-Region Transformers Video (ORViT), an emphobject-centric approach that extends transformer video layers with object representations.
Our ORViT block consists of two object-level streams: appearance and dynamics.
We show strong improvement in performance across all tasks and considered, demonstrating the value of a model that incorporates object representations into a transformer architecture.
arXiv Detail & Related papers (2021-10-13T17:51:46Z) - Generative Video Transformer: Can Objects be the Words? [22.788711301106765]
We propose the Object-Centric Video Transformer (OCVT) which utilizes an object-centric approach for decomposing scenes into tokens suitable for use in a generative video transformer.
By factoring video into objects, our fully unsupervised model is able to learn complex-temporal dynamics of multiple objects in a scene and generate future frames of the video.
Our model is also significantly more memory-efficient than pixel-based models and thus able to train on videos of length up to 70 frames with a single 48GB GPU.
arXiv Detail & Related papers (2021-07-20T03:08:39Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - ViViT: A Video Vision Transformer [75.74690759089529]
We present pure-transformer based models for video classification.
Our model extracts-temporal tokens from the input video, which are then encoded by a series of transformer layers.
We show how we can effectively regularise the model during training and leverage pretrained image models to be able to train on comparatively small datasets.
arXiv Detail & Related papers (2021-03-29T15:27:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.