Texture-Based Input Feature Selection for Action Recognition
- URL: http://arxiv.org/abs/2303.00138v3
- Date: Sun, 23 Apr 2023 09:00:53 GMT
- Title: Texture-Based Input Feature Selection for Action Recognition
- Authors: Yalong Jiang
- Abstract summary: We propose a novel method to determine the task-irrelevant content in inputs which increases the domain discrepancy.
We show that our proposed model is superior to existing models for action recognition on the HMDB-51 dataset and the Penn Action dataset.
- Score: 3.9596068699962323
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The performance of video action recognition has been significantly boosted by
using motion representations within a two-stream Convolutional Neural Network
(CNN) architecture. However, there are a few challenging problems in action
recognition in real scenarios, e.g., the variations in viewpoints and poses,
and the changes in backgrounds. The domain discrepancy between the training
data and the test data causes the performance drop. To improve the model
robustness, we propose a novel method to determine the task-irrelevant content
in inputs which increases the domain discrepancy. The method is based on a
human parsing model (HP model) which jointly conducts dense correspondence
labelling and semantic part segmentation. The predictions from the HP model
also function as re-rendering the human regions in each video using the same
set of textures to make humans appearances in all classes be the same. A
revised dataset is generated for training and testing and makes the action
recognition model exhibit invariance to the irrelevant content in the inputs.
Moreover, the predictions from the HP model are used to enrich the inputs to
the AR model during both training and testing. Experimental results show that
our proposed model is superior to existing models for action recognition on the
HMDB-51 dataset and the Penn Action dataset.
Related papers
- MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - Adversarial Augmentation Training Makes Action Recognition Models More
Robust to Realistic Video Distribution Shifts [13.752169303624147]
Action recognition models often lack robustness when faced with natural distribution shifts between training and test data.
We propose two novel evaluation methods to assess model resilience to such distribution disparity.
We experimentally demonstrate the superior performance of the proposed adversarial augmentation approach over baselines across three state-of-the-art action recognition models.
arXiv Detail & Related papers (2024-01-21T05:50:39Z) - Helping Hands: An Object-Aware Ego-Centric Video Recognition Model [60.350851196619296]
We introduce an object-aware decoder for improving the performance of ego-centric representations on ego-centric videos.
We show that the model can act as a drop-in replacement for an ego-awareness video model to improve performance through visual-text grounding.
arXiv Detail & Related papers (2023-08-15T17:58:11Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - CONVIQT: Contrastive Video Quality Estimator [63.749184706461826]
Perceptual video quality assessment (VQA) is an integral component of many streaming and video sharing platforms.
Here we consider the problem of learning perceptually relevant video quality representations in a self-supervised manner.
Our results indicate that compelling representations with perceptual bearing can be obtained using self-supervised learning.
arXiv Detail & Related papers (2022-06-29T15:22:01Z) - Few Shot Activity Recognition Using Variational Inference [9.371378627575883]
We propose a novel variational inference based architectural framework (HF-AR) for few shot activity recognition.
Our framework leverages volume-preserving Householder Flow to learn a flexible posterior distribution of the novel classes.
This results in better performance as compared to state-of-the-art few shot approaches for human activity recognition.
arXiv Detail & Related papers (2021-08-20T03:57:58Z) - Spatio-Temporal Human Action Recognition Modelwith Flexible-interval
Sampling and Normalization [0.0]
We propose a human action system for Red-Green-Blue(RGB) video input with our own designed module.
We build a novel dataset with a similar background and discriminative actions for both human keypoint prediction and behavior recognition.
Experimental results demonstrate the effectiveness of the proposed model on our own human behavior recognition dataset and some public datasets.
arXiv Detail & Related papers (2021-08-12T10:02:20Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Spatial-Temporal Alignment Network for Action Recognition and Detection [80.19235282200697]
This paper studies how to introduce viewpoint-invariant feature representations that can help action recognition and detection.
We propose a novel Spatial-Temporal Alignment Network (STAN) that aims to learn geometric invariant representations for action recognition and action detection.
We test our STAN model extensively on AVA, Kinetics-400, AVA-Kinetics, Charades, and Charades-Ego datasets.
arXiv Detail & Related papers (2020-12-04T06:23:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.