Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
- URL: http://arxiv.org/abs/2010.09982v1
- Date: Tue, 20 Oct 2020 03:06:20 GMT
- Title: Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
- Authors: Yuqian Fu, Li Zhang, Junke Wang, Yanwei Fu and Yu-Gang Jiang
- Abstract summary: Few-shot video recognition aims at learning new actions with only very few labeled samples.
We propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net.
- Score: 86.31412529187243
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can easily recognize actions with only a few examples given, while the
existing video recognition models still heavily rely on the large-scale labeled
data inputs. This observation has motivated an increasing interest in few-shot
video action recognition, which aims at learning new actions with only very few
labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion
Network for few-shot video recognition which is termed as AMeFu-Net.
Concretely, we tackle the few-shot recognition problem from three aspects:
firstly, we alleviate this extremely data-scarce problem by introducing depth
information as a carrier of the scene, which will bring extra visual
information to our model; secondly, we fuse the representation of original RGB
clips with multiple non-strictly corresponding depth clips sampled by our
temporal asynchronization augmentation mechanism, which synthesizes new
instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance
Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream
modalities efficiently. Additionally, to better mimic the few-shot recognition
process, our model is trained in the meta-learning way. Extensive experiments
on several action recognition benchmarks demonstrate the effectiveness of our
model.
Related papers
- A Survey on Backbones for Deep Video Action Recognition [7.3390139372713445]
Action recognition is a key technology in building interactive metaverses.
This paper reviews several action recognition methods based on deep neural networks.
We introduce these methods in three parts: 1) Two-Streams networks and their variants, which, specifically in this paper, use RGB video frame and optical flow modality as input; 2) 3D convolutional networks, which make efforts in taking advantage of RGB modality directly while extracting different motion information is no longer necessary; 3) Transformer-based methods, which introduce the model from natural language processing into computer vision and video understanding.
arXiv Detail & Related papers (2024-05-09T07:20:36Z) - Multi-view Action Recognition via Directed Gromov-Wasserstein Discrepancy [12.257725479880458]
Action recognition has become one of the popular research topics in computer vision.
We propose a multi-view attention consistency method that computes the similarity between two attentions from two different views of the action videos.
Our approach applies the idea of Neural Radiance Field to implicitly render the features from novel views when training on single-view datasets.
arXiv Detail & Related papers (2024-05-02T14:43:21Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - Few-Shot Fine-Grained Action Recognition via Bidirectional Attention and
Contrastive Meta-Learning [51.03781020616402]
Fine-grained action recognition is attracting increasing attention due to the emerging demand of specific action understanding in real-world applications.
We propose a few-shot fine-grained action recognition problem, aiming to recognize novel fine-grained actions with only few samples given for each class.
Although progress has been made in coarse-grained actions, existing few-shot recognition methods encounter two issues handling fine-grained actions.
arXiv Detail & Related papers (2021-08-15T02:21:01Z) - EAN: Event Adaptive Network for Enhanced Action Recognition [66.81780707955852]
We propose a unified action recognition framework to investigate the dynamic nature of video content.
First, when extracting local cues, we generate the spatial-temporal kernels of dynamic-scale to adaptively fit the diverse events.
Second, to accurately aggregate these cues into a global video representation, we propose to mine the interactions only among a few selected foreground objects by a Transformer.
arXiv Detail & Related papers (2021-07-22T15:57:18Z) - Human Action Recognition Based on Multi-scale Feature Maps from Depth
Video Sequences [12.30399970340689]
We present a novel framework focusing on multi-scale motion information to recognize human actions from depth video sequences.
We employ depth motion images (DMI) as the templates to generate the multi-scale static representation of actions.
We extract the multi-granularity descriptor called LP-DMI-HOG to provide more discriminative features.
arXiv Detail & Related papers (2021-01-19T13:46:42Z) - A Comprehensive Study of Deep Video Action Recognition [35.7068977497202]
Video action recognition is one of the representative tasks for video understanding.
We provide a comprehensive survey of over 200 existing papers on deep learning for video action recognition.
arXiv Detail & Related papers (2020-12-11T18:54:08Z) - Adaptive Context-Aware Multi-Modal Network for Depth Completion [107.15344488719322]
We propose to adopt the graph propagation to capture the observed spatial contexts.
We then apply the attention mechanism on the propagation, which encourages the network to model the contextual information adaptively.
Finally, we introduce the symmetric gated fusion strategy to exploit the extracted multi-modal features effectively.
Our model, named Adaptive Context-Aware Multi-Modal Network (ACMNet), achieves the state-of-the-art performance on two benchmarks.
arXiv Detail & Related papers (2020-08-25T06:00:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.