Knowledge Fusion Transformers for Video Action Recognition
- URL: http://arxiv.org/abs/2009.13782v2
- Date: Wed, 30 Sep 2020 03:53:44 GMT
- Title: Knowledge Fusion Transformers for Video Action Recognition
- Authors: Ganesh Samarth, Sheetal Ojha, Nikhil Pareek
- Abstract summary: We present a self-attention based feature enhancer to fuse action knowledge in 3D based- context of the video clip intended to be classified.
We show, how using only one stream and with little or, no pretraining can pave the way for a performance close to the current state-of-the-art.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Knowledge Fusion Transformers for video action classification.
We present a self-attention based feature enhancer to fuse action knowledge in
3D inception based spatio-temporal context of the video clip intended to be
classified. We show, how using only one stream networks and with little or, no
pretraining can pave the way for a performance close to the current
state-of-the-art. Additionally, we present how different self-attention
architectures used at different levels of the network can be blended-in to
enhance feature representation. Our architecture is trained and evaluated on
UCF-101 and Charades dataset, where it is competitive with the state of the
art. It also exceeds by a large gap from single stream networks with no to less
pretraining.
Related papers
- AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder [38.04963261966939]
We propose a video-level pre-training scheme for facial action units (FAU) detection.
At the heart of our design is a pre-trained video feature extractor based on the video-masked autoencoder.
Our approach demonstrates substantial enhancement in performance compared to the existing state-of-the-art methods used in BP4D and DISFA FAUs datasets.
arXiv Detail & Related papers (2024-07-16T08:07:47Z) - Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction
Clips [38.02945794078731]
We tackle the task of reconstructing hand-object interactions from short video clips.
Our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape.
We empirically evaluate our approach on egocentric videos, and observe significant improvements over prior single-view and multi-view methods.
arXiv Detail & Related papers (2023-09-11T17:58:30Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Beyond the Field-of-View: Enhancing Scene Visibility and Perception with Clip-Recurrent Transformer [28.326852785609788]
FlowLens architecture explicitly employs optical flow and implicitly incorporates a novel clip-recurrent transformer for feature propagation.
In this paper, we propose the concept of online video inpainting for autonomous vehicles to expand the field of view.
Experiments and user studies involving offline and online video inpainting, as well as beyondFo-V perception tasks, demonstrate that Flows achieves state-of-the-art performance.
arXiv Detail & Related papers (2022-11-21T09:34:07Z) - Frozen CLIP Models are Efficient Video Learners [86.73871814176795]
Video recognition has been dominated by the end-to-end learning paradigm.
Recent advances in Contrastive Vision-Language Pre-training pave the way for a new route for visual recognition tasks.
We present Efficient Video Learning -- an efficient framework for directly training high-quality video recognition models.
arXiv Detail & Related papers (2022-08-06T17:38:25Z) - Activating More Pixels in Image Super-Resolution Transformer [53.87533738125943]
Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution.
We propose a novel Hybrid Attention Transformer (HAT) to activate more input pixels for better reconstruction.
Our overall method significantly outperforms the state-of-the-art methods by more than 1dB.
arXiv Detail & Related papers (2022-05-09T17:36:58Z) - Event and Activity Recognition in Video Surveillance for Cyber-Physical
Systems [0.0]
Long-term motion patterns alone play a pivotal role in the task of recognizing an event.
We show that the long-term motion patterns alone play a pivotal role in the task of recognizing an event.
Only the temporal features are exploited using a hybrid Convolutional Neural Network (CNN) + Recurrent Neural Network (RNN) architecture.
arXiv Detail & Related papers (2021-11-03T08:30:38Z) - Efficient Modelling Across Time of Human Actions and Interactions [92.39082696657874]
We argue that current fixed-sized-temporal kernels in 3 convolutional neural networks (CNNDs) can be improved to better deal with temporal variations in the input.
We study how we can better handle between classes of actions, by enhancing their feature differences over different layers of the architecture.
The proposed approaches are evaluated on several benchmark action recognition datasets and show competitive results.
arXiv Detail & Related papers (2021-10-05T15:39:11Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Self-Supervised Learning via multi-Transformation Classification for
Action Recognition [10.676377556393527]
We introduce a self-supervised video representation learning method based on the multi-transformation classification to efficiently classify human actions.
The representation of the video is learned in a self-supervised manner by classifying seven different transformations.
We have conducted the experiments on UCF101 and HMDB51 datasets together with C3D and 3D Resnet-18 as backbone networks.
arXiv Detail & Related papers (2021-02-20T16:11:26Z) - AssembleNet++: Assembling Modality Representations via Attention
Connections [83.50084190050093]
We create a family of powerful video models which are able to: (i) learn interactions between semantic object information and raw appearance and motion features, and (ii) deploy attention in order to better learn the importance of features at each convolutional block of the network.
A new network component named peer-attention is introduced, which dynamically learns the attention weights using another block or input modality.
arXiv Detail & Related papers (2020-08-18T17:54:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.