Universal-to-Specific Framework for Complex Action Recognition
- URL: http://arxiv.org/abs/2007.06149v1
- Date: Mon, 13 Jul 2020 01:49:07 GMT
- Title: Universal-to-Specific Framework for Complex Action Recognition
- Authors: Peisen Zhao, Lingxi Xie, Ya Zhang, Qi Tian
- Abstract summary: We propose an effective universal-to-specific (U2S) framework for complex action recognition.
The U2S framework is composed of threeworks: a universal network, a category-specific network, and a mask network.
Experiments on a variety of benchmark datasets demonstrate the effectiveness of the U2S framework.
- Score: 114.78468658086572
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based action recognition has recently attracted much attention in the
field of computer vision. To solve more complex recognition tasks, it has
become necessary to distinguish different levels of interclass variations.
Inspired by a common flowchart based on the human decision-making process that
first narrows down the probable classes and then applies a "rethinking" process
for finer-level recognition, we propose an effective universal-to-specific
(U2S) framework for complex action recognition. The U2S framework is composed
of three subnetworks: a universal network, a category-specific network, and a
mask network. The universal network first learns universal feature
representations. The mask network then generates attention masks for confusing
classes through category regularization based on the output of the universal
network. The mask is further used to guide the category-specific network for
class-specific feature representations. The entire framework is optimized in an
end-to-end manner. Experiments on a variety of benchmark datasets, e.g., the
Something-Something, UCF101, and HMDB51 datasets, demonstrate the effectiveness
of the U2S framework; i.e., U2S can focus on discriminative spatiotemporal
regions for confusing categories. We further visualize the relationship between
different classes, showing that U2S indeed improves the discriminability of
learned features. Moreover, the proposed U2S model is a general framework and
may adopt any base recognition network.
Related papers
- Siamese Transformer Networks for Few-shot Image Classification [9.55588609556447]
Humans exhibit remarkable proficiency in visual classification tasks, accurately recognizing and classifying new images with minimal examples.
Existing few-shot image classification methods often emphasize either global features or local features, with few studies considering the integration of both.
We propose a novel approach based on the Siamese Transformer Network (STN)
Our strategy effectively harnesses the potential of global and local features in few-shot image classification, circumventing the need for complex feature adaptation modules.
arXiv Detail & Related papers (2024-07-16T14:27:23Z) - Label Anything: Multi-Class Few-Shot Semantic Segmentation with Visual Prompts [10.262029691744921]
We present Label Anything, an innovative neural network architecture designed for few-shot semantic segmentation (FSS)
Label Anything demonstrates remarkable generalizability across multiple classes with minimal examples required per class.
Our comprehensive experimental validation, particularly achieving state-of-the-art results on the COCO-$20i$ benchmark, underscores Label Anything's robust generalization and flexibility.
arXiv Detail & Related papers (2024-07-02T09:08:06Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z) - Generalized Few-Shot Continual Learning with Contrastive Mixture of
Adapters [59.82088750033897]
We set up a Generalized FSCL (GFSCL) protocol involving both class- and domain-incremental situations.
We find that common continual learning methods have poor generalization ability on unseen domains.
In this way, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA)
arXiv Detail & Related papers (2023-02-12T15:18:14Z) - All Grains, One Scheme (AGOS): Learning Multi-grain Instance
Representation for Aerial Scene Classification [31.412401135677744]
We propose a novel all grains, one scheme (AGOS) framework to tackle these challenges.
It consists of a multi-grain perception module (MGP), a multi-branch multi-instance representation module (MBMIR) and a self-aligned semantic fusion (SSF) module.
Our AGOS is flexible and can be easily adapted to existing CNNs in a plug-and-play manner.
arXiv Detail & Related papers (2022-05-06T17:10:44Z) - Semantic-diversity transfer network for generalized zero-shot learning
via inner disagreement based OOD detector [26.89763840782029]
Zero-shot learning (ZSL) aims to recognize objects from unseen classes, where the kernel problem is to transfer knowledge from seen classes to unseen classes.
The knowledge transfer in many existing works is limited mainly due to the facts that 1) the widely used visual features are global ones but not totally consistent with semantic attributes.
We propose a Semantic-diversity transfer Network (SetNet) addressing the first two limitations, where 1) a multiple-attention architecture and a diversity regularizer are proposed to learn multiple local visual features that are more consistent with semantic attributes and 2) a projector ensemble that geometrically takes diverse local features as inputs
arXiv Detail & Related papers (2022-03-17T01:31:27Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z) - Multi-level Second-order Few-shot Learning [111.0648869396828]
We propose a Multi-level Second-order (MlSo) few-shot learning network for supervised or unsupervised few-shot image classification and few-shot action recognition.
We leverage so-called power-normalized second-order base learner streams combined with features that express multiple levels of visual abstraction.
We demonstrate respectable results on standard datasets such as Omniglot, mini-ImageNet, tiered-ImageNet, Open MIC, fine-grained datasets such as CUB Birds, Stanford Dogs and Cars, and action recognition datasets such as HMDB51, UCF101, and mini-MIT.
arXiv Detail & Related papers (2022-01-15T19:49:00Z) - Uni-EDEN: Universal Encoder-Decoder Network by Multi-Granular
Vision-Language Pre-training [120.91411454661741]
We present a pre-trainable Universal-DEcoder Network (Uni-EDEN) to facilitate both vision-language perception and generation.
Uni-EDEN is a two-stream Transformer based structure, consisting of three modules: object and sentence encoders that separately learns the representations of each modality.
arXiv Detail & Related papers (2022-01-11T16:15:07Z) - Group Based Deep Shared Feature Learning for Fine-grained Image
Classification [31.84610555517329]
We present a new deep network architecture that explicitly models shared features and removes their effect to achieve enhanced classification results.
We call this framework Group based deep Shared Feature Learning (GSFL) and the resulting learned network as GSFL-Net.
A key benefit of our specialized autoencoder is that it is versatile and can be combined with state-of-the-art fine-grained feature extraction models and trained together with them to improve their performance directly.
arXiv Detail & Related papers (2020-04-04T00:01:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.