Co-training Transformer with Videos and Images Improves Action
Recognition
- URL: http://arxiv.org/abs/2112.07175v1
- Date: Tue, 14 Dec 2021 05:41:39 GMT
- Title: Co-training Transformer with Videos and Images Improves Action
Recognition
- Authors: Bowen Zhang, Jiahui Yu, Christopher Fifty, Wei Han, Andrew M. Dai,
Ruoming Pang, Fei Sha
- Abstract summary: In learning action recognition, models are typically pretrained on object recognition images, such as ImageNet, and later finetuned on target action recognition with videos.
This approach has achieved good empirical performance especially with recent transformer-based video architectures.
We show how video transformers benefit from joint training on diverse video datasets and label spaces.
- Score: 49.160505782802886
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In learning action recognition, models are typically pre-trained on object
recognition with images, such as ImageNet, and later fine-tuned on target
action recognition with videos. This approach has achieved good empirical
performance especially with recent transformer-based video architectures. While
recently many works aim to design more advanced transformer architectures for
action recognition, less effort has been made on how to train video
transformers. In this work, we explore several training paradigms and present
two findings. First, video transformers benefit from joint training on diverse
video datasets and label spaces (e.g., Kinetics is appearance-focused while
SomethingSomething is motion-focused). Second, by further co-training with
images (as single-frame videos), the video transformers learn even better video
representations. We term this approach as Co-training Videos and Images for
Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K
based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1
Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When
pretrained on larger-scale image datasets following previous state-of-the-art,
CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%),
Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time
(46.1%), with a simple spatio-temporal video transformer.
Related papers
- It Takes Two: Masked Appearance-Motion Modeling for Self-supervised
Video Transformer Pre-training [76.69480467101143]
Self-supervised video transformer pre-training has recently benefited from the mask-and-predict pipeline.
We explicitly investigate motion cues in videos as extra prediction target and propose our Masked Appearance-Motion Modeling framework.
Our method learns generalized video representations and achieves 82.3% on Kinects-400, 71.3% on Something-Something V2, 91.5% on UCF101, and 62.5% on HMDB51.
arXiv Detail & Related papers (2022-10-11T08:05:18Z) - BEVT: BERT Pretraining of Video Transformers [89.08460834954161]
We introduce BEVT which decouples video representation learning into spatial representation learning and temporal dynamics learning.
We conduct extensive experiments on three challenging video benchmarks where BEVT achieves very promising results.
arXiv Detail & Related papers (2021-12-02T18:59:59Z) - Improved Multiscale Vision Transformers for Classification and Detection [80.64111139883694]
We study Multiscale Vision Transformers (MViT) as a unified architecture for image and video classification, as well as object detection.
We present an improved version of MViT that incorporates decomposed relative positional embeddings and residual pooling connections.
We instantiate this architecture in five sizes and evaluate it for ImageNet classification, COCO detection and Kinetics video recognition.
arXiv Detail & Related papers (2021-12-02T18:59:57Z) - VideoLightFormer: Lightweight Action Recognition using Transformers [8.871042314510788]
We propose a novel, lightweight action recognition architecture, VideoLightFormer.
In a factorized fashion, we carefully extend the 2D convolutional Temporal Network with transformers.
We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-SV-V-Something2 datasets.
arXiv Detail & Related papers (2021-07-01T13:55:52Z) - Video Swin Transformer [41.41741134859565]
We advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off.
The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain.
Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks.
arXiv Detail & Related papers (2021-06-24T17:59:46Z) - Towards Training Stronger Video Vision Transformers for
EPIC-KITCHENS-100 Action Recognition [27.760120524736678]
We present empirical results for training a stronger video vision transformer on the EPIC-KITCHENS-100 Action Recognition dataset.
A single ViViT model achieves the performance of 47.4% on the validation set of EPIC-KITCHENS-100 dataset.
We find that video transformers are especially good at predicting the noun in the verb-noun action prediction task.
arXiv Detail & Related papers (2021-06-09T13:26:02Z) - VATT: Transformers for Multimodal Self-Supervised Learning from Raw
Video, Audio and Text [60.97904439526213]
Video-Audio-Text Transformer (VATT) takes raw signals as inputs and extracts multimodal representations that are rich enough to benefit a variety of downstream tasks.
We train VATT end-to-end from scratch using multimodal contrastive losses and evaluate its performance by the downstream tasks of video action recognition, audio event classification, image classification, and text-to-video retrieval.
arXiv Detail & Related papers (2021-04-22T17:07:41Z) - Space-Time Crop & Attend: Improving Cross-modal Video Representation
Learning [88.71867887257274]
We show that spatial augmentations such as cropping work well for videos too, but that previous implementations could not do this at a scale sufficient for it to work well.
To address this issue, we first introduce Feature Crop, a method to simulate such augmentations much more efficiently directly in feature space.
Second, we show that as opposed to naive average pooling, the use of transformer-based attention performance improves significantly.
arXiv Detail & Related papers (2021-03-18T12:32:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.