Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action
Recognition
- URL: http://arxiv.org/abs/2208.01897v1
- Date: Wed, 3 Aug 2022 08:01:55 GMT
- Title: Combined CNN Transformer Encoder for Enhanced Fine-grained Human Action
Recognition
- Authors: Mei Chee Leong, Haosong Zhang, Hui Li Tan, Liyuan Li, Joo Hwee Lim
- Abstract summary: We investigate two frameworks that combine CNN vision backbone and Transformer to enhance fine-grained action recognition.
Our experimental results show that both our Transformer encoder frameworks effectively learn latent temporal semantics and cross-modality association.
We achieve new state-of-the-art performance on the FineGym benchmark dataset for both proposed architectures.
- Score: 11.116921653535226
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Fine-grained action recognition is a challenging task in computer vision. As
fine-grained datasets have small inter-class variations in spatial and temporal
space, fine-grained action recognition model requires good temporal reasoning
and discrimination of attribute action semantics. Leveraging on CNN's ability
in capturing high level spatial-temporal feature representations and
Transformer's modeling efficiency in capturing latent semantics and global
dependencies, we investigate two frameworks that combine CNN vision backbone
and Transformer Encoder to enhance fine-grained action recognition: 1) a
vision-based encoder to learn latent temporal semantics, and 2) a multi-modal
video-text cross encoder to exploit additional text input and learn cross
association between visual and text semantics. Our experimental results show
that both our Transformer encoder frameworks effectively learn latent temporal
semantics and cross-modality association, with improved recognition performance
over CNN vision model. We achieve new state-of-the-art performance on the
FineGym benchmark dataset for both proposed architectures.
Related papers
- Neuromorphic Synergy for Video Binarization [54.195375576583864]
Bimodal objects serve as a visual form to embed information that can be easily recognized by vision systems.
Neuromorphic cameras offer new capabilities for alleviating motion blur, but it is non-trivial to first de-blur and then binarize the images in a real-time manner.
We propose an event-based binary reconstruction method that leverages the prior knowledge of the bimodal target's properties to perform inference independently in both event space and image space.
We also develop an efficient integration method to propagate this binary image to high frame rate binary video.
arXiv Detail & Related papers (2024-02-20T01:43:51Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Video-SwinUNet: Spatio-temporal Deep Learning Framework for VFSS
Instance Segmentation [10.789826145990016]
This paper presents a deep learning framework for medical video segmentation.
Our framework explicitly extracts features from neighbouring frames across the temporal dimension.
It incorporates them with a temporal feature blender, which then tokenises the high-level-temporal feature to form a strong global feature encoded via a Swin Transformer.
arXiv Detail & Related papers (2023-02-22T12:09:39Z) - Image Captioning based on Feature Refinement and Reflective Decoding [0.0]
This paper introduces an encoder-decoder-based image captioning system.
It extracts spatial and global features for each region in the image using the Faster R-CNN with ResNet-101 as a backbone.
The decoder consists of an attention-based recurrent module and a reflective attention module to enhance the decoder's ability to model long-term sequential dependencies.
arXiv Detail & Related papers (2022-06-16T07:56:28Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - Unsupervised Motion Representation Learning with Capsule Autoencoders [54.81628825371412]
Motion Capsule Autoencoder (MCAE) models motion in a two-level hierarchy.
MCAE is evaluated on a novel Trajectory20 motion dataset and various real-world skeleton-based human action datasets.
arXiv Detail & Related papers (2021-10-01T16:52:03Z) - Spatial-Temporal Transformer for Dynamic Scene Graph Generation [34.190733855032065]
We propose a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input.
Our method is validated on the benchmark dataset Action Genome (AG)
arXiv Detail & Related papers (2021-07-26T16:30:30Z) - Dense Interaction Learning for Video-based Person Re-identification [75.03200492219003]
We propose a hybrid framework, Dense Interaction Learning (DenseIL), to tackle video-based person re-ID difficulties.
DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder.
Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based re-ID datasets.
arXiv Detail & Related papers (2021-03-16T12:22:08Z) - Temporal Distinct Representation Learning for Action Recognition [139.93983070642412]
Two-Dimensional Convolutional Neural Network (2D CNN) is used to characterize videos.
Different frames of a video share the same 2D CNN kernels, which may result in repeated and redundant information utilization.
We propose a sequential channel filtering mechanism to excite the discriminative channels of features from different frames step by step, and thus avoid repeated information extraction.
Our method is evaluated on benchmark temporal reasoning datasets Something-Something V1 and V2, and it achieves visible improvements over the best competitor by 2.4% and 1.3%, respectively.
arXiv Detail & Related papers (2020-07-15T11:30:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.